Return-Path: X-Original-To: apmail-spark-user-archive@minotaur.apache.org Delivered-To: apmail-spark-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BAF02179F2 for ; Wed, 20 May 2015 22:42:41 +0000 (UTC) Received: (qmail 12643 invoked by uid 500); 20 May 2015 22:42:38 -0000 Delivered-To: apmail-spark-user-archive@spark.apache.org Received: (qmail 12537 invoked by uid 500); 20 May 2015 22:42:37 -0000 Mailing-List: contact user-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@spark.apache.org Received: (qmail 12527 invoked by uid 99); 20 May 2015 22:42:37 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 20 May 2015 22:42:37 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 77B8D1A383E for ; Wed, 20 May 2015 22:42:37 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.881 X-Spam-Level: ** X-Spam-Status: No, score=2.881 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id 6220ydBtgVkk for ; Wed, 20 May 2015 22:42:31 +0000 (UTC) Received: from mail-lb0-f170.google.com (mail-lb0-f170.google.com [209.85.217.170]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 2201A4543A for ; Wed, 20 May 2015 22:42:31 +0000 (UTC) Received: by lbcmx3 with SMTP id mx3so5320622lbc.1 for ; Wed, 20 May 2015 15:42:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=fbTb5DPSn2QjNcUlFMuGlJ/UmtlgoJ4IhVgtnYJEOx4=; b=gv31k8GtB5YZUARp0gaiZdfDPdbSJ0D1BK0P6OQZ9ExuA6GOIYfFk3MXJDTHq5oS97 bsk58SS+D2yx+munxgD72ALIU0DIENWGp6bRR0Pmh3oYSw4EiUn3YEze7gTVA6lsatls W1g4iUoEWg9Ua4B3NnhO5jjKtTR4cUFxj08/1Iz0iXUNFRNvcqMq4MxZjOmYHSMVqysE ho5mbGeIsVH0V542180jxek7sW7ngwdIlSRwNlIu1d/GWO17w/ZvfOcFjwWVeZq7uOaD 1k3wSW3RtzpTZ+3nqCiDlmM73p3OWpbYy+pSoYKXg1ziqqdXwbiqDJ9/4v5uEtl6E4WZ nPTQ== MIME-Version: 1.0 X-Received: by 10.152.6.105 with SMTP id z9mr26948540laz.98.1432161744412; Wed, 20 May 2015 15:42:24 -0700 (PDT) Received: by 10.25.15.157 with HTTP; Wed, 20 May 2015 15:42:24 -0700 (PDT) Date: Wed, 20 May 2015 15:42:24 -0700 Message-ID: Subject: Compare LogisticRegression results using Mllib with those using other libraries (e.g. statsmodel) From: Xin Liu To: user@spark.apache.org Content-Type: multipart/alternative; boundary=089e01493e7c09645705168b241e --089e01493e7c09645705168b241e Content-Type: text/plain; charset=UTF-8 Hi, I have tried a few models in Mllib to train a LogisticRegression model. However, I consistently get much better results using other libraries such as statsmodel (which gives similar results as R) in terms of AUC. For illustration purpose, I used a small data (I have tried much bigger data) http://www.ats.ucla.edu/stat/data/binary.csv in http://www.ats.ucla.edu/stat/r/dae/logit.htm Here is the snippet of my usage of LogisticRegressionWithLBFGS. val algorithm = new LogisticRegressionWithLBFGS algorithm.setIntercept(true) algorithm.optimizer .setNumIterations(100) .setRegParam(0.01) .setConvergenceTol(1e-5) val model = algorithm.run(training) model.clearThreshold() val scoreAndLabels = test.map { point => val score = model.predict(point.features) (score, point.label) } val metrics = new BinaryClassificationMetrics(scoreAndLabels) val auROC = metrics.areaUnderROC() I did a (0.6, 0.4) split for training/test. The response is "admit" and features are "GRE score", "GPA", and "college Rank". Spark: Weights (GRE, GPA, Rank): [0.0011576276331509304,0.048544858567336854,-0.394202150286076] Intercept: -0.6488972641282202 Area under ROC: 0.6294070512820512 StatsModel: Weights [0.0018, 0.7220, -0.3148] Intercept: -3.5913 Area under ROC: 0.69 The weights from statsmodel seems more reasonable if you consider for a one unit increase in gpa, the log odds of being admitted to graduate school increases by 0.72 in statsmodel than 0.04 in Spark. I have seen much bigger difference with other data. So my question is has anyone compared the results with other libraries and is anything wrong with my code to invoke LogisticRegressionWithLBFGS? As the real data I am processing is pretty big and really want to use Spark to get this to work. Please let me know if you have similar experience and how you resolve it. Thanks, Xin --089e01493e7c09645705168b241e Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi,

I have tried a few models in Mllib = to train a LogisticRegression model. However, I consistently get much bette= r results using other libraries such as statsmodel (which gives similar res= ults as R) in terms of AUC. For illustration purpose, I used a small data (= I have tried much bigger data)

Here is the snippet of my usage of LogisticRegressionWithLBFGS.

val algorithm =3D new LogisticRegressionWithLBFGS=
=C2=A0 =C2=A0 =C2=A0algorithm.setIntercept(true)
=C2= =A0 =C2=A0 =C2=A0algorithm.optimizer
=C2=A0 =C2=A0 =C2=A0 =C2=A0.= setNumIterations(100)
=C2=A0 =C2=A0 =C2=A0 =C2=A0.setRegParam(0.0= 1)
=C2=A0 =C2=A0 =C2=A0 =C2=A0.setConvergenceTol(1e-5)
= =C2=A0 =C2=A0 =C2=A0val model =3D algorithm.run(training)
=C2=A0 = =C2=A0 =C2=A0model.clearThreshold()
=C2=A0 =C2=A0 =C2=A0val score= AndLabels =3D test.map { point =3D>
=C2=A0 =C2=A0 =C2=A0 =C2= =A0val score =3D model.predict(point.features)
=C2=A0 =C2=A0 =C2= =A0 =C2=A0(score, point.label)
=C2=A0 =C2=A0 =C2=A0}
= =C2=A0 =C2=A0 =C2=A0val metrics =3D new BinaryClassificationMetrics(scoreAn= dLabels)
=C2=A0 =C2=A0 =C2=A0val auROC =3D metrics.areaUnderROC()=

I did a (0.6, 0.4) split for training/test.= The response is "admit" and features are "GRE score", = "GPA", and "college Rank".

Spa= rk:=C2=A0
Weights (GRE, GPA, Rank): [0.0011576276331509304,0.0485= 44858567336854,-0.394202150286076]
Intercept: -0.6488972641282202=
Area under ROC: 0.6294070512820512
=C2=A0
St= atsModel:=C2=A0
Weights [0.0018,=C2=A00.7220,=C2=A0-0.3148]
=
Intercept: -3.5913
Area under ROC: 0.69

The weights from statsmodel seems more reasonable if you consider = for a one unit increase in gpa, the log odds of being admitted to graduate = school increases by 0.72 in statsmodel than 0.04 in Spark.

I have seen much bigger difference with other data. So my question= is has anyone compared the results with other libraries and is anything wr= ong with my code to invoke LogisticRegressionWithLBFGS?

<= /div>
As the real data I am processing is pretty big and really want to= use Spark to get this to work. Please let me know if you have similar expe= rience and how you resolve it.

Thanks,
X= in=C2=A0
--089e01493e7c09645705168b241e--