mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Emre Sevinc <>
Subject Why doesn't Mahout logistic regression give a good AUC when the model is tested on training data?
Date Sun, 18 Jan 2015 15:39:16 GMT

I'm using the logistic regression of Mahout (version 0.9) but when I check
the created model on the same data set it was trained for, I do not see a
high value for AUC. I would expect it to be very high since it is the same
data set.

My data set is a CSV file with about 7 million lines and has 18 attributes,
some numerical and some categorical.

This is how I create the model for logistic regression (I ignore some of
the attributes):

$ mahout trainlogistic --input train.csv \
--output ./model \
--categories 2 \
--predictors attribute1 ... attribute15 \
--types w w w n n w w w w w w w n n n \
--target is_delayed \
--rate 100 \
--passes 2 \
--features 500000

And then when I check the AUC value using the model on the same data set:

$ mahout runlogistic --input train.csv --model ./model --auc --confusion
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using /usr/lib/hadoop/bin/hadoop and
MAHOUT-JOB: /usr/lib/mahout/mahout-examples-0.9-cdh5.3.0-job.jar
AUC = 0.48
confusion: [[1703477.0, 761921.0], [3034369.0, 1137161.0]]
entropy: [[NaN, NaN], [-16.5, -17.4]]
15/01/18 06:50:50 INFO driver.MahoutDriver: Program took 98213 ms (Minutes:

I'm really confused why I only get AUC = 0.48, instead of 1.00 or something
very close since it is the same data set.

Do I miss something? What are the things I should check first?

I tried with only a few attributes but still very low AUC, around 0.47,
that means the model is almost guessing randomly, even worse, right?


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message