Return-Path: X-Original-To: apmail-spark-dev-archive@minotaur.apache.org Delivered-To: apmail-spark-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 13FEF18262 for ; Thu, 28 Apr 2016 19:06:50 +0000 (UTC) Received: (qmail 44325 invoked by uid 500); 28 Apr 2016 19:06:47 -0000 Delivered-To: apmail-spark-dev-archive@spark.apache.org Received: (qmail 44209 invoked by uid 500); 28 Apr 2016 19:06:47 -0000 Mailing-List: contact dev-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list dev@spark.apache.org Received: (qmail 44198 invoked by uid 99); 28 Apr 2016 19:06:47 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 28 Apr 2016 19:06:47 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id DCE87C77EB for ; Thu, 28 Apr 2016 19:06:46 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.279 X-Spam-Level: * X-Spam-Status: No, score=1.279 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=teamaol-com.20150623.gappssmtp.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id 9n5rdSGnNZT4 for ; Thu, 28 Apr 2016 19:06:44 +0000 (UTC) Received: from mail-wm0-f41.google.com (mail-wm0-f41.google.com [74.125.82.41]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 39E5C5F297 for ; Thu, 28 Apr 2016 19:06:44 +0000 (UTC) Received: by mail-wm0-f41.google.com with SMTP id n129so961284wmn.1 for ; Thu, 28 Apr 2016 12:06:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=teamaol-com.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc; bh=tPiE7Odxp+P/UJsn29Hvb6ngJbl3pPDvlivOVNZ05us=; b=Zjn4lVQw6+YppfJaCGUbw9FQ7i63ae1RE/L7Mn1d/hQDM7x1JknR9d4ikAuVWeC1lb n83l0Er0GAU0FhLulvLqT1PYVoAVQlsU1l0DwLwltCeUyDsi2Sa7pCpy4CU1KqLDjBvt q9lS/qPPxuv/3euNBwdW2hBrG08Mu7di+yTUKCncgoJq3IFoOA3H6SwHgnF+hMKhDmwF GZP/RynxFpdSYbxxzyba9LNOLr1dU4/qUGqH0dcE1Bnr3qIczl/I1ulMZt4I6d+aeXUA jELXCJw1nv2bxhtXyKthYbeuNClgup2KrPSYDZMUQqaF0V7iWPhXsF3nJi9xNyCWHTqK fDQA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc; bh=tPiE7Odxp+P/UJsn29Hvb6ngJbl3pPDvlivOVNZ05us=; b=RUIBIgnuWEXTqikoFJDEsbn82Mx9XRESDY1Laq9j1LloO/ImS0cR6VuQV51//8uc62 LuT0xMbi0WIuhKYkn/RT8QG9zqcrN+USCHYC6bJVG106CMgqzA+t+s7D6WbMs5cA9xJC eiOTm98fEy/noUPsRxYzrv0Buuwy+vlCmMWGLh4+m6ZWUPfuxjXqs5Cu7liiciBT5v+u IiWvFbKmqOlQNp8PQgPYXLjA39/7PMRQ6/cOBlzSPFb0vP1i/9nrP52zs3J2UC6s7Bjb siCkqLbbdmHpiN1FLZsDNaM8oq3K0MHiC5E4x6FkAqa71NXIJgtkPvMM9sqaFZfOpySC yxBg== X-Gm-Message-State: AOPr4FVc41Jx3mapKeIIjCcZ2QZ/fDHtwuWacsrm3ZrIhuYezxM8EHJg145cwM2iZCA75P8nD4CDddETnlnmPXh0 MIME-Version: 1.0 X-Received: by 10.28.1.213 with SMTP id 204mr28528055wmb.20.1461870403066; Thu, 28 Apr 2016 12:06:43 -0700 (PDT) Received: by 10.28.156.140 with HTTP; Thu, 28 Apr 2016 12:06:42 -0700 (PDT) In-Reply-To: References: Date: Thu, 28 Apr 2016 15:06:42 -0400 Message-ID: Subject: Re: Spark ML - Scaling logistic regression for many features From: Daniel Siegmann To: Nick Pentreath Cc: "dev@spark.apache.org" Content-Type: multipart/alternative; boundary=001a113c895c153c6e0531903acb --001a113c895c153c6e0531903acb Content-Type: text/plain; charset=UTF-8 FYI: https://issues.apache.org/jira/browse/SPARK-14464 I have submitted a PR as well. On Fri, Mar 18, 2016 at 7:15 AM, Nick Pentreath wrote: > No, I didn't yet - feel free to create a JIRA. > > > > On Thu, 17 Mar 2016 at 22:55 Daniel Siegmann > wrote: > >> Hi Nick, >> >> Thanks again for your help with this. Did you create a ticket in JIRA for >> investigating sparse models in LR and / or multivariate summariser? If so, >> can you give me the issue key(s)? If not, would you like me to create these >> tickets? >> >> I'm going to look into this some more and see if I can figure out how to >> implement these fixes. >> >> ~Daniel Siegmann >> >> On Sat, Mar 12, 2016 at 5:53 AM, Nick Pentreath > > wrote: >> >>> Also adding dev list in case anyone else has ideas / views. >>> >>> On Sat, 12 Mar 2016 at 12:52, Nick Pentreath >>> wrote: >>> >>>> Thanks for the feedback. >>>> >>>> I think Spark can certainly meet your use case when your data size >>>> scales up, as the actual model dimension is very small - you will need to >>>> use those indexers or some other mapping mechanism. >>>> >>>> There is ongoing work for Spark 2.0 to make it easier to use models >>>> outside of Spark - also see PMML export (I think mllib logistic regression >>>> is supported but I have to check that). That will help use spark models in >>>> serving environments. >>>> >>>> Finally, I will add a JIRA to investigate sparse models for LR - maybe >>>> also a ticket for multivariate summariser (though I don't think in practice >>>> there will be much to gain). >>>> >>>> >>>> On Fri, 11 Mar 2016 at 21:35, Daniel Siegmann < >>>> daniel.siegmann@teamaol.com> wrote: >>>> >>>>> Thanks for the pointer to those indexers, those are some good >>>>> examples. A good way to go for the trainer and any scoring done in Spark. I >>>>> will definitely have to deal with scoring in non-Spark systems though. >>>>> >>>>> I think I will need to scale up beyond what single-node liblinear can >>>>> practically provide. The system will need to handle much larger sub-samples >>>>> of this data (and other projects might be larger still). Additionally, the >>>>> system needs to train many models in parallel (hyper-parameter optimization >>>>> with n-fold cross-validation, multiple algorithms, different sets of >>>>> features). >>>>> >>>>> Still, I suppose we'll have to consider whether Spark is the best >>>>> system for this. For now though, my job is to see what can be achieved with >>>>> Spark. >>>>> >>>>> >>>>> >>>>> On Fri, Mar 11, 2016 at 12:45 PM, Nick Pentreath < >>>>> nick.pentreath@gmail.com> wrote: >>>>> >>>>>> Ok, I think I understand things better now. >>>>>> >>>>>> For Spark's current implementation, you would need to map those >>>>>> features as you mention. You could also use say StringIndexer -> >>>>>> OneHotEncoder or VectorIndexer. You could create a Pipeline to deal with >>>>>> the mapping and training (e.g. >>>>>> http://spark.apache.org/docs/latest/ml-guide.html#example-pipeline). >>>>>> Pipeline supports persistence. >>>>>> >>>>>> But it depends on your scoring use case too - a Spark pipeline can be >>>>>> saved and then reloaded, but you need all of Spark dependencies in your >>>>>> serving app which is often not ideal. If you're doing bulk scoring offline, >>>>>> then it may suit. >>>>>> >>>>>> Honestly though, for that data size I'd certainly go with something >>>>>> like Liblinear :) Spark will ultimately scale better with # training >>>>>> examples for very large scale problems. However there are definitely >>>>>> limitations on model dimension and sparse weight vectors currently. There >>>>>> are potential solutions to these but they haven't been implemented as yet. >>>>>> >>>>>> On Fri, 11 Mar 2016 at 18:35 Daniel Siegmann < >>>>>> daniel.siegmann@teamaol.com> wrote: >>>>>> >>>>>>> On Fri, Mar 11, 2016 at 5:29 AM, Nick Pentreath < >>>>>>> nick.pentreath@gmail.com> wrote: >>>>>>> >>>>>>>> Would you mind letting us know the # training examples in the >>>>>>>> datasets? Also, what do your features look like? Are they text, categorical >>>>>>>> etc? You mention that most rows only have a few features, and all rows >>>>>>>> together have a few 10,000s features, yet your max feature value is 20 >>>>>>>> million. How are your constructing your feature vectors to get a 20 million >>>>>>>> size? The only realistic way I can see this situation occurring in practice >>>>>>>> is with feature hashing (HashingTF). >>>>>>>> >>>>>>> >>>>>>> The sub-sample I'm currently training on is about 50K rows, so ... >>>>>>> small. >>>>>>> >>>>>>> The features causing this issue are numeric (int) IDs for ... lets >>>>>>> call it "Thing". For each Thing in the record, we set the feature >>>>>>> Thing.id to a value of 1.0 in our vector (which is of course a >>>>>>> SparseVector). I'm not sure how IDs are generated for Things, but >>>>>>> they can be large numbers. >>>>>>> >>>>>>> The largest Thing ID is around 20 million, so that ends up being the >>>>>>> size of the vector. But in fact there are fewer than 10,000 unique Thing >>>>>>> IDs in this data. The mean number of features per record in what I'm >>>>>>> currently training against is 41, while the maximum for any given record >>>>>>> was 1754. >>>>>>> >>>>>>> It is possible to map the features into a small set (just need to >>>>>>> zipWithIndex), but this is undesirable because of the added complexity (not >>>>>>> just for the training, but also anything wanting to score against the >>>>>>> model). It might be a little easier if this could be encapsulated within >>>>>>> the model object itself (perhaps via composition), though I'm not sure how >>>>>>> feasible that is. >>>>>>> >>>>>>> But I'd rather not bother with dimensionality reduction at all - >>>>>>> since we can train using liblinear in just a few minutes, it doesn't seem >>>>>>> necessary. >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> MultivariateOnlineSummarizer uses dense arrays, but it should be >>>>>>>> possible to enable sparse data. Though in theory, the result will tend to >>>>>>>> be dense anyway, unless you have very many entries in the input feature >>>>>>>> vector that never occur and are actually zero throughout the data set >>>>>>>> (which it seems is the case with your data?). So I doubt whether using >>>>>>>> sparse vectors for the summarizer would improve performance in general. >>>>>>>> >>>>>>> >>>>>>> Yes, that is exactly my case - the vast majority of entries in the >>>>>>> input feature vector will *never* occur. Presumably that means most >>>>>>> of the values in the aggregators' arrays will be zero. >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> LR doesn't accept a sparse weight vector, as it uses dense vectors >>>>>>>> for coefficients and gradients currently. When using L1 regularization, it >>>>>>>> could support sparse weight vectors, but the current implementation doesn't >>>>>>>> do that yet. >>>>>>>> >>>>>>> >>>>>>> Good to know it is theoretically possible to implement. I'll have to >>>>>>> give it some thought. In the meantime I guess I'll experiment with >>>>>>> coalescing the data to minimize the communication overhead. >>>>>>> >>>>>>> Thanks again. >>>>>>> >>>>>> >>>>> >> --001a113c895c153c6e0531903acb Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
I have submitted a PR as well.
On Fri, Mar 18, 2016 at 7:15 AM, Nick Pentreat= h <nick.pentreath@gmail.com> wrote:
No, I didn't yet - feel free to crea= te a JIRA.=C2=A0


<= div class=3D"h5">
On Thu, 17= Mar 2016 at 22:55 Daniel Siegmann <daniel.siegmann@teamaol.com> wrote:
=
Hi Nic= k,

Thanks again for your help with this. Did you create a tick= et in JIRA for investigating sparse models in LR and / or multivariate summ= ariser? If so, can you give me the issue key(s)? If not, would you like me = to create these tickets?

I'm going to look into this some = more and see if I can figure out how to implement these fixes.

~Daniel Siegmann

On Sat, Mar 12, 2016 at 5:53 AM, Nick Pent= reath <nick.pentreath@gmail.com> wrote:
Also adding dev li= st in case anyone else has ideas / views.

On Sat, 12 Mar 2016 at 12:52, Nick Pentreath = <nick.pent= reath@gmail.com> wrote:
Than= ks for the feedback.

I think Spark can certainly meet your use case = when your data size scales up, as the actual model dimension is very small = - you will need to use those indexers or some other mapping mechanism.
<= br>There is ongoing work for Spark 2.0 to make it easier to use models outs= ide of Spark - also see PMML export (I think mllib logistic regression is s= upported but I have to check that). That will help use spark models in serv= ing environments.

Finally, I will add a JIRA to investigate sparse m= odels for LR - maybe also a ticket for multivariate summariser (though I do= n't think in practice there will be much to gain).


On Fri, 11 Mar 2016 at 21:35, Daniel Sie= gmann <= daniel.siegmann@teamaol.com> wrote:
Thanks for the pointer to those indexers, tho= se are some good examples. A good way to go for the trainer and any scoring= done in Spark. I will definitely have to deal with scoring in non-Spark sy= stems though.

I think I will need to scale up beyond what= single-node liblinear can practically provide. The system will need to han= dle much larger sub-samples of this data (and other projects might be large= r still). Additionally, the system needs to train many models in parallel (= hyper-parameter optimization with=20 n-fold cross-validation, multiple algorithms, different sets of=20 features).

Still, I suppose we'll have to consider whether= Spark is the best system for this. For now though, my job is to see what c= an be achieved with Spark.



On Fri, Mar 11, 2016 at 12= :45 PM, Nick Pentreath <nick.pentreath@gmail.com> wro= te:
Ok, I think I unders= tand things better now.

For Spark's current implemen= tation, you would need to map those features as you mention. You could also= use say StringIndexer -> OneHotEncoder or VectorIndexer. You could crea= te a Pipeline to deal with the mapping and training (e.g.=C2=A0http://spark.apache.org/docs/latest/ml-guide.html#example-pipel= ine). Pipeline supports persistence.

But it de= pends on your scoring use case too - a Spark pipeline can be saved and then= reloaded, but you need all of Spark dependencies in your serving app which= is often not ideal. If you're doing bulk scoring offline, then it may = suit.

Honestly though, for that data size I'd = certainly go with something like Liblinear :) Spark will ultimately scale b= etter with # training examples for very large scale problems. However there= are definitely limitations on model dimension and sparse weight vectors cu= rrently. There are potential solutions to these but they haven't been i= mplemented as yet.

On Fri, 11 Mar 2016 at 18:35 Daniel Siegmann <daniel.siegmann@teamaol.= com> wrote:
On Fri, Mar 11, 2016 at 5:29 AM, Nick Pentreath <nick.pentreath@gm= ail.com> wrote:
Would you mind letting us know the # training examples in the= datasets? Also, what do your features look like? Are they text, categorica= l etc? You mention that most rows only have a few features, and all rows to= gether have a few 10,000s features, yet your max feature value is 20 millio= n. How are your constructing your feature vectors to get a 20 million size?= The only realistic way I can see this situation occurring in practice is w= ith feature hashing (HashingTF).

The sub-sample I'm currently training on is about= 50K rows, so ... small.

The features causing this issue = are numeric (int) ID= s for ... lets call it "Thing". For each Thing in the record, we = set the feature Thing.id to a value of 1.0 in our vector (which is of course a SparseVector). I'm not sure how IDs are generated fo= r Things, but they can be large numbers.

The largest Thin= g ID is around 20 million, so that ends up being the size of the vector. Bu= t in fact there are fewer than 10,000 unique Thing IDs in this data. The me= an number of features per record in what I'm currently training against= is 41, while the maximum for any given record was 1754.

= It is possible to map the features into a small set (just need to zipWithIn= dex), but this is undesirable because of the added complexity (not just for= the training, but also anything wanting to score against the model). It mi= ght be a little easier if this could be encapsulated within the model objec= t itself (perhaps via composition), though I'm not sure how feasible th= at is.

But I'd rather not bother with dimensionality = reduction at all - since we can train using liblinear in just a few minutes= , it doesn't seem necessary.
=C2=A0

MultivariateOnlineSummari= zer=C2=A0uses dense arrays, but it should be possible to enable spar= se data. Though in theory, the result will tend to be dense anyway, unless = you have very many entries in the input feature vector that never occur and= are actually zero throughout the data set (which it seems is the case with= your data?). So I doubt whether using sparse vectors for the summarizer wo= uld improve performance in general.

=
Yes, that is exactly my case - the vast majority of entr= ies in the input feature vector will never occur. Presumably that me= ans most of the values in the aggregators' arrays will be zero.
=C2=A0

LR doesn't accept a sparse weight vector, = as it uses dense vectors for coefficients and gradients currently. When usi= ng L1 regularization, it could support sparse weight vectors, but the curre= nt implementation doesn't do that yet.
Good to know it is theoretically possible to impl= ement. I'll have to give it some thought. In the meantime I guess I'= ;ll experiment with coalescing the data to minimize the communication overh= ead.

Thanks again.



--001a113c895c153c6e0531903acb--