Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0E0FA18A1C for ; Sat, 13 Feb 2016 17:22:19 +0000 (UTC) Received: (qmail 85733 invoked by uid 500); 13 Feb 2016 17:22:18 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 85673 invoked by uid 500); 13 Feb 2016 17:22:18 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 85662 invoked by uid 99); 13 Feb 2016 17:22:18 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 13 Feb 2016 17:22:18 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 1F6722C1F5B for ; Sat, 13 Feb 2016 17:22:18 +0000 (UTC) Date: Sat, 13 Feb 2016 17:22:18 +0000 (UTC) From: "Joel Bernstein (JIRA)" To: dev@lucene.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (SOLR-8492) Add LogisticRegressionQuery and LogitStream MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/SOLR-8492?page=3Dcom.atlassian.= jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D15146= 074#comment-15146074 ]=20 Joel Bernstein commented on SOLR-8492: -------------------------------------- [~caomanhdat], I've been testing out the algorithm by comparing it to R. I'= ve attached the data set I'm using to the ticket. I'm attempting make sense= of the results but they don't seem to match up. I ran the following LogitStream call: {code} logit(collection3, q=3D"*:*", features=3D"married_i", outcome=3D"buy_i", ma= xIterations=3D100) {code} And got this for the final weights: {code} {"features":["married_i"],"weights":[-0.5048080969966502],"error":0.3808560= 947044004} {code} In R, here is the output for the same data set. *Note* that *Is.Married* is= the same field as *married_i* in Solr and same for *Buy* and *buy_i* {code} Call: glm(formula =3D Buy ~ Is.Married, family =3D "binomial", data =3D mydata) Deviance Residuals:=20 Min 1Q Median 3Q Max =20 -0.9687 -0.4201 -0.4201 -0.4201 2.2232 =20 Coefficients: Estimate Std. Error z value Pr(>|z|) =20 (Intercept) -2.3830 0.1718 -13.870 <2e-16 *** Is.Married 1.8699 0.2184 8.563 <2e-16 *** --- Signif. codes: 0 =E2=80=98***=E2=80=99 0.001 =E2=80=98**=E2=80=99 0.01 =E2= =80=98*=E2=80=99 0.05 =E2=80=98.=E2=80=99 0.1 =E2=80=98 =E2=80=99 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 646.05 on 672 degrees of freedom Residual deviance: 564.47 on 671 degrees of freedom AIC: 568.47 Number of Fisher Scoring iterations: 5 {code} It appears that they are giving different results. Am I performing the test= correctly? > Add LogisticRegressionQuery and LogitStream > ------------------------------------------- > > Key: SOLR-8492 > URL: https://issues.apache.org/jira/browse/SOLR-8492 > Project: Solr > Issue Type: New Feature > Reporter: Joel Bernstein > Attachments: SOLR-8492.patch, SOLR-8492.patch, SOLR-8492.patch, S= OLR-8492.patch, SOLR-8492.patch, SOLR-8492.patch > > > This ticket is to add a new query called a LogisticRegressionQuery (LRQ). > The LRQ extends AnalyticsQuery (http://joelsolr.blogspot.com/2015/12/unde= rstanding-solrs-analyticsquery.html) and returns a DelegatingCollector that= implements a Stochastic Gradient Descent (SGD) optimizer for Logistic Regr= ession. > This ticket also adds the LogitStream which leverages Streaming Expressio= ns to provide iteration over the shards. Each call to LogitStream.read() ca= lls down to the shards and executes the LogisticRegressionQuery. The model = data is collected from the shards and the weights are averaged and sent bac= k to the shards with the next iteration. Each call to read() returns a Tupl= e with the averaged weights and error from the shards. With this approach t= he LogitStream streams the changing model back to the client after each ite= ration. > The LogitStream will return the EOF Tuple when it reaches the defined max= Iterations. When sent as a Streaming Expression to the Stream handler this = provides parallel iterative behavior. This same approach can be used to imp= lement other parallel iterative algorithms. > The initial patch has a test which simply tests the mechanics of the ite= ration. More work will need to be done to ensure the SGD is properly implem= ented. The distributed approach of the SGD will also need to be reviewed. = =20 > This implementation is designed for use cases with a small number of feat= ures because each feature is it's own discreet field. > An implementation which supports a higher number of features would be pos= sible by packing features into a byte array and storing as binary DocValues= . > This implementation is designed to support a large sample set. With a lar= ge number of shards, a sample set into the billions may be possible. > sample Streaming Expression Syntax: > {code} > logit(collection1, features=3D"a,b,c,d,e,f" outcome=3D"x" maxIterations= =3D"80") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org