Hi,
I have tried a few models in Mllib to train a LogisticRegression model.
However, I consistently get much better results using other libraries such
as statsmodel (which gives similar results as R) in terms of AUC. For
illustration purpose, I used a small data (I have tried much bigger data)
http://www.ats.ucla.edu/stat/data/binary.csv in
http://www.ats.ucla.edu/stat/r/dae/logit.htm
Here is the snippet of my usage of LogisticRegressionWithLBFGS.
val algorithm = new LogisticRegressionWithLBFGS
algorithm.setIntercept(true)
algorithm.optimizer
.setNumIterations(100)
.setRegParam(0.01)
.setConvergenceTol(1e5)
val model = algorithm.run(training)
model.clearThreshold()
val scoreAndLabels = test.map { point =>
val score = model.predict(point.features)
(score, point.label)
}
val metrics = new BinaryClassificationMetrics(scoreAndLabels)
val auROC = metrics.areaUnderROC()
I did a (0.6, 0.4) split for training/test. The response is "admit" and
features are "GRE score", "GPA", and "college Rank".
Spark:
Weights (GRE, GPA, Rank):
[0.0011576276331509304,0.048544858567336854,0.394202150286076]
Intercept: 0.6488972641282202
Area under ROC: 0.6294070512820512
StatsModel:
Weights [0.0018, 0.7220, 0.3148]
Intercept: 3.5913
Area under ROC: 0.69
The weights from statsmodel seems more reasonable if you consider for a one
unit increase in gpa, the log odds of being admitted to graduate school
increases by 0.72 in statsmodel than 0.04 in Spark.
I have seen much bigger difference with other data. So my question is has
anyone compared the results with other libraries and is anything wrong with
my code to invoke LogisticRegressionWithLBFGS?
As the real data I am processing is pretty big and really want to use Spark
to get this to work. Please let me know if you have similar experience and
how you resolve it.
Thanks,
Xin
