Mailing-List: contact user-help@spark.apache.org; run by ezmlm
Precedence: bulk
MIME-Version: 1.0
Date: Wed, 20 May 2015 15:42:24 -0700
Message-ID: 
 <CAA5DXBWCaw7kn_N4UF7__qiw-nr+NQB6t-T5SFgxqui7KDSKtw@mail.gmail.com>
Subject: Compare LogisticRegression results using Mllib with those using other
 libraries (e.g. statsmodel)
From: Xin Liu <liuxin.ub@gmail.com>
To: user@spark.apache.org
Content-Type: multipart/alternative; boundary=089e01493e7c09645705168b241e

--089e01493e7c09645705168b241e
Content-Type: text/plain; charset=UTF-8

Hi,

I have tried a few models in Mllib to train a LogisticRegression model.
However, I consistently get much better results using other libraries such
as statsmodel (which gives similar results as R) in terms of AUC. For
illustration purpose, I used a small data (I have tried much bigger data)
 http://www.ats.ucla.edu/stat/data/binary.csv in
http://www.ats.ucla.edu/stat/r/dae/logit.htm

Here is the snippet of my usage of LogisticRegressionWithLBFGS.

val algorithm = new LogisticRegressionWithLBFGS
     algorithm.setIntercept(true)
     algorithm.optimizer
       .setNumIterations(100)
       .setRegParam(0.01)
       .setConvergenceTol(1e-5)
     val model = algorithm.run(training)
     model.clearThreshold()
     val scoreAndLabels = test.map { point =>
       val score = model.predict(point.features)
       (score, point.label)
     }
     val metrics = new BinaryClassificationMetrics(scoreAndLabels)
     val auROC = metrics.areaUnderROC()

I did a (0.6, 0.4) split for training/test. The response is "admit" and
features are "GRE score", "GPA", and "college Rank".

Spark:
Weights (GRE, GPA, Rank):
[0.0011576276331509304,0.048544858567336854,-0.394202150286076]
Intercept: -0.6488972641282202
Area under ROC: 0.6294070512820512

StatsModel:
Weights [0.0018, 0.7220, -0.3148]
Intercept: -3.5913
Area under ROC: 0.69

The weights from statsmodel seems more reasonable if you consider for a one
unit increase in gpa, the log odds of being admitted to graduate school
increases by 0.72 in statsmodel than 0.04 in Spark.

I have seen much bigger difference with other data. So my question is has
anyone compared the results with other libraries and is anything wrong with
my code to invoke LogisticRegressionWithLBFGS?

As the real data I am processing is pretty big and really want to use Spark
to get this to work. Please let me know if you have similar experience and
how you resolve it.

Thanks,
Xin

--089e01493e7c09645705168b241e
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi,<div><br></div><div>I have tried a few models in Mllib =
to train a LogisticRegression model. However, I consistently get much bette=
r results using other libraries such as statsmodel (which gives similar res=
ults as R) in terms of AUC. For illustration purpose, I used a small data (=
I have tried much bigger data)</div><div>=C2=A0<a href=3D"http://www.ats.uc=
la.edu/stat/data/binary.csv">http://www.ats.ucla.edu/stat/data/binary.csv</=
a> in=C2=A0<br></div><div><a href=3D"http://www.ats.ucla.edu/stat/r/dae/log=
it.htm">http://www.ats.ucla.edu/stat/r/dae/logit.htm</a></div><div><br></di=
v><div>Here is the snippet of my usage of LogisticRegressionWithLBFGS.</div=
><div><br></div><div><div>val algorithm =3D new LogisticRegressionWithLBFGS=
</div><div>=C2=A0 =C2=A0 =C2=A0algorithm.setIntercept(true)</div><div>=C2=
=A0 =C2=A0 =C2=A0algorithm.optimizer</div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0.=
setNumIterations(100)</div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0.setRegParam(0.0=
1)</div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0.setConvergenceTol(1e-5)</div><div>=
=C2=A0 =C2=A0 =C2=A0val model =3D algorithm.run(training)</div><div>=C2=A0 =
=C2=A0 =C2=A0model.clearThreshold()</div><div>=C2=A0 =C2=A0 =C2=A0val score=
AndLabels =3D test.map { point =3D&gt;</div><div>=C2=A0 =C2=A0 =C2=A0 =C2=
=A0val score =3D model.predict(point.features)</div><div>=C2=A0 =C2=A0 =C2=
=A0 =C2=A0(score, point.label)</div><div>=C2=A0 =C2=A0 =C2=A0}</div><div>=
=C2=A0 =C2=A0 =C2=A0val metrics =3D new BinaryClassificationMetrics(scoreAn=
dLabels)</div><div>=C2=A0 =C2=A0 =C2=A0val auROC =3D metrics.areaUnderROC()=
</div></div><div><br></div><div>I did a (0.6, 0.4) split for training/test.=
 The response is &quot;admit&quot; and features are &quot;GRE score&quot;, =
&quot;GPA&quot;, and &quot;college Rank&quot;.</div><div><br></div><div>Spa=
rk:=C2=A0</div><div>Weights (GRE, GPA, Rank): [0.0011576276331509304,0.0485=
44858567336854,-0.394202150286076]</div><div>Intercept: -0.6488972641282202=
</div><div>Area under ROC: 0.6294070512820512</div><div>=C2=A0</div><div>St=
atsModel:=C2=A0</div><div>Weights [0.0018,=C2=A00.7220,=C2=A0-0.3148]</div>=
<div>Intercept: -3.5913<br></div><div>Area under ROC: 0.69</div><div><br></=
div><div>The weights from statsmodel seems more reasonable if you consider =
for a one unit increase in gpa, the log odds of being admitted to graduate =
school increases by 0.72 in statsmodel than 0.04 in Spark.</div><div><br></=
div><div>I have seen much bigger difference with other data. So my question=
 is has anyone compared the results with other libraries and is anything wr=
ong with my code to invoke LogisticRegressionWithLBFGS?<br></div><div><br><=
/div><div>As the real data I am processing is pretty big and really want to=
 use Spark to get this to work. Please let me know if you have similar expe=
rience and how you resolve it.</div><div><br></div><div>Thanks,</div><div>X=
in=C2=A0</div></div>

--089e01493e7c09645705168b241e--