mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian McCallister <bri...@skife.org>
Subject Help with Classifier
Date Wed, 13 Feb 2013 23:29:55 GMT
I'm trying to do a basic two category classifier on textual data, I am
working with a training set of only about 100,000 documents, and am using
an AdaptiveLogisticRegression with default settings.

When I build the trainer it reports:


% correct:       0.9996315789473774
AUC:              0.75
log likelihood: -0.032966543010819874

Which seems pretty good.

When I then classify the *training data* everything lands in the first
category, when in fact they are split down the middle.

Creation of vectors looks like:

        FeatureVectorEncoder content_encoder = new
AdaptiveWordValueEncoder("content");
        content_encoder.setProbes(2);

        FeatureVectorEncoder type_encoder = new
StaticWordValueEncoder("type");
        type_encoder.setProbes(2);

        Vector v = new RandomAccessSparseVector(100);
        type_encoder.addToVector(type, v);

        for (String word : data.getWords()) {
            content_encoder.addToVector(word, v);
        }
        return new NamedVector(v, label);

where data.getWords() is the massaged (tidy, extract characters, then run
trhough lucene standard analyzer and lower case filter) content of various
documents.\

Training looks like:

            Configuration hconf = new Configuration();
            FileSystem fs = FileSystem.get(path, hconf);

            SequenceFile.Reader reader = new SequenceFile.Reader(fs, new
Path(path), hconf);
            LongWritable key = new LongWritable();
            VectorWritable value = new VectorWritable();
            AdaptiveLogisticRegression reg = new
AdaptiveLogisticRegression(2, 100, new L1());

            while (reader.next(key, value)) {
                NamedVector v = (NamedVector) value.get();
                System.out.println(v.getName());
                reg.train("spam".equals(v.getName()) ? 1 : 0, v);
            }
            reader.close();
            reg.close();
            CrossFoldLearner best = reg.getBest().getPayload().getLearner();
            System.out.println(best.percentCorrect());
            System.out.println(best.auc());
            System.out.println(best.getLogLikelihood());

            ModelSerializer.writeBinary(model.getPath(),
reg.getBest().getPayload().getLearner());


And running through the test data looks like:

            InputStream in = new FileInputStream(model);
            CrossFoldLearner best = ModelSerializer.readBinary(in,
CrossFoldLearner.class);
            in.close();

            Configuration hconf = new Configuration();
            FileSystem fs = FileSystem.get(path, hconf);

            SequenceFile.Reader reader = new SequenceFile.Reader(fs, new
Path(path), hconf);
            LongWritable key = new LongWritable();
            VectorWritable value = new VectorWritable();

            int correct = 0;
            int total = 0;
            while (reader.next(key, value)) {
                total++;
                NamedVector v = (NamedVector) value.get();
                int expected = "spam".equals(v.getName()) ? 1 : 0;
                Vector p = new DenseVector(2);
                best.classifyFull(p, v);
                int cat = p.maxValueIndex();
                System.out.println(cat == 1 ? "SPAM" : "HAM");
                if (cat == expected) { correct++;}
            }
            reader.close();
            best.close();

            double cd = correct;
            double td = total;

            System.out.println(cd / td);

Can anyone help me figure out what I am doing wrong?

Also, I'd love to try naive bayes or complementary naive bayes, but I am
unable to find any documentation on how to do so :-(

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message