mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sapan.par...@eclinicalworks.com
Subject Simple Mahout classification example
Date Thu, 21 Feb 2013 09:57:46 GMT


I want to train create model for classification. For me this text is coming from database
and I really do not want to store them to file for mahout training. I checked out the the
MIA source code and changed the following code for very basic training task. Usual issue with
mahout examples are either they show how to use mahout from cmd prompt using 20 news group,
or the code has lot of dependency on Hadoop Zookeeper etc. I will really appreciate if someone
can have a look at my code, or point me to a very simple tutorial which show how to train
a model and then use it.
As of now in following code I am never getting past if (best != null) becauselearningAlgorithm.getBest();
is always returning null!
Sorry for posting the whole code, but didn't see any other option
 
public class Classifier {    private static final int FEATURES = 10000;    private static
final TextValueEncoder encoder = new TextValueEncoder("body");    private static final FeatureVectorEncoder
bias = new ConstantValueEncoder("Intercept");    private static final String[] LEAK_LABELS
= {"none", "month-year", "day-month-year"};    /**     * @param args the command line arguments
    */    public static void main(String[] args) throws Exception {        int leakType =
0;        // TODO code application logic here        AdaptiveLogisticRegression learningAlgorithm
= new AdaptiveLogisticRegression(20, FEATURES, new L1());        Dictionary newsGroups = new
Dictionary();        //ModelDissector md = new ModelDissector();        ListMultimap<String,
String> noteBySection = LinkedListMultimap.create();        noteBySection.put("good", "I
love this product, the screen is a pleasure to work with and is a great choice for any business");
       noteBySection.put("good", "What a product!! Really amazing clarity and works pretty
well");        noteBySection.put("good", "This product has good battery life and is a little
bit heavy but I like it");        noteBySection.put("bad", "I am really bored with the same
UI, this is their 5th version(or fourth or sixth, who knows) and it looks just like the first
one");        noteBySection.put("bad", "The phone is bulky and useless");        noteBySection.put("bad",
"I wish i had never bought this laptop. It died in the first year and now i am not able to
return it");        encoder.setProbes(2);        double step = 0;        int[] bumps = {1,
2, 5};        double averageCorrect = 0;        double averageLL = 0;        int k = 0;  
     //-------------------------------------        //notes.keySet()        for (String key
: noteBySection.keySet()) {            System.out.println(key);            List<String>
notes = noteBySection.get(key);            for (Iterator<String> it = notes.iterator();
it.hasNext();) {                String note = it.next();                int actual = newsGroups.intern(key);
               Vector v = encodeFeatureVector(note);                learningAlgorithm.train(actual,
v);                k++;                int bump = bumps[(int) Math.floor(step) % bumps.length];
               int scale = (int) Math.pow(10, Math.floor(step / bumps.length));          
     State<AdaptiveLogisticRegression.Wrapper, CrossFoldLearner> best = learningAlgorithm.getBest();
               double maxBeta;                double nonZeros;                double positive;
               double norm;                double lambda = 0;                double mu = 0;
               if (best != null) {                    CrossFoldLearner state = best.getPayload().getLearner();
                   averageCorrect = state.percentCorrect();                    averageLL =
state.logLikelihood();                    OnlineLogisticRegression model = state.getModels().get(0);
                   // finish off pending regularization                    model.close();
                   Matrix beta = model.getBeta();                    maxBeta = beta.aggregate(Functions.MAX,
Functions.ABS);                    nonZeros = beta.aggregate(Functions.PLUS, new DoubleFunction()
{                        @Override                        public double apply(double v) {
                           return Math.abs(v) > 1.0e-6 ? 1 : 0;                       
}                    });                    positive = beta.aggregate(Functions.PLUS, new
DoubleFunction() {                        @Override                        public double apply(double
v) {                            return v > 0 ? 1 : 0;                        }        
           });                    norm = beta.aggregate(Functions.PLUS, Functions.ABS);  
                 lambda = learningAlgorithm.getBest().getMappedParams()[0];              
     mu = learningAlgorithm.getBest().getMappedParams()[1];                } else {      
             maxBeta = 0;                    nonZeros = 0;                    positive = 0;
                   norm = 0;                }                System.out.println(k % (bump
* scale));                if (k % (bump * scale) == 0) {                    if (learningAlgorithm.getBest()
!= null) {                        System.out.println("----------------------------");    
                   ModelSerializer.writeBinary("c:/tmp/news-group-" + k + ".model",      
                         learningAlgorithm.getBest().getPayload().getLearner().getModels().get(0));
                   }                    step += 0..25;                    System.out.printf("%.2f\t%.2f\t%.2f\t%.2f\t%.8g\t%.8g\t",
maxBeta, nonZeros, positive, norm, lambda, mu);                    System.out.printf("%d\t%.3f\t%.2f\t%s\n",
                           k, averageLL, averageCorrect * 100, LEAK_LABELS[leakType % 3]);
               }            }        }         learningAlgorithm.close();    }    private
static Vector encodeFeatureVector(String text) {        encoder.addText(text.toLowerCase());
       //System.out.println(encoder.asString(text));        Vector v = new RandomAccessSparseVector(FEATURES);
       bias.addToVector((byte[]) null, 1, v);        encoder.flush(1, v);        return v;
   }}
 
 
Sapankumar Parikh
 Product Development
 
eClinicalWorks
2 Technology Drive | Westborough, MA 01581
T: 5084750450 x 17269
[mailto:john.doe@eclinicalworks.com] sapan.parikh@eclinicalworks.com | [http://www.eclinicalworks.com/]
www.eclinicalworks.com 
70,000+ physicians | 220,000+ providers | 410,000+ users | 23,000+ practices
Voted Most Interesting Vendor in 2010 by Healthcare Informatics | Top-rated vendor by IDC
Health Insights | Seven Davies Award Winners – eCW Customers | Named in Inc. 5000 list 2007
- 2012 

This transmission contains confidential information belonging to the sender that is legally
privileged and proprietary and may be subject to protection under the law, including the Health
Insurance Portability and Accountability Act (HIPAA). If you are not the intended recipient
of this e-mail, you are prohibited from sharing, copying, or otherwise using or disclosing
its contents. If you have received this e-mail in error, please notify the sender immediately
by reply e-mail and permanently delete this e-mail and any attachments without reading, forwarding,
or saving them. Thank you.
 Please consider the environment and only print this e-mail if necessary
 
 

CONFIDENTIALITY NOTICE TO RECIPIENT: This transmission contains confidential information belonging
to the sender that is legally privileged and proprietary and may be subject to protection
under the law, including the Health Insurance Portability and Accountability Act (HIPAA).
If you are not the intended recipient of this e-mail, you are prohibited from sharing, copying,
or otherwise using or disclosing its contents. If you have received this e-mail in error,
please notify the sender immediately by reply e-mail and permanently delete this e-mail and
any attachments without reading, forwarding or saving them. Thank you.
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message