Return-Path: X-Original-To: apmail-spark-dev-archive@minotaur.apache.org Delivered-To: apmail-spark-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3200118D2A for ; Fri, 18 Mar 2016 11:15:25 +0000 (UTC) Received: (qmail 43063 invoked by uid 500); 18 Mar 2016 11:15:22 -0000 Delivered-To: apmail-spark-dev-archive@spark.apache.org Received: (qmail 42949 invoked by uid 500); 18 Mar 2016 11:15:22 -0000 Mailing-List: contact dev-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list dev@spark.apache.org Received: (qmail 42938 invoked by uid 99); 18 Mar 2016 11:15:22 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Mar 2016 11:15:22 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 233E5C1344 for ; Fri, 18 Mar 2016 11:15:22 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.179 X-Spam-Level: * X-Spam-Status: No, score=1.179 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx2-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id lmv-jcR-3cEP for ; Fri, 18 Mar 2016 11:15:19 +0000 (UTC) Received: from mail-lf0-f51.google.com (mail-lf0-f51.google.com [209.85.215.51]) by mx2-lw-eu.apache.org (ASF Mail Server at mx2-lw-eu.apache.org) with ESMTPS id 210E55F24D for ; Fri, 18 Mar 2016 11:15:19 +0000 (UTC) Received: by mail-lf0-f51.google.com with SMTP id v130so35335445lfd.2 for ; Fri, 18 Mar 2016 04:15:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=upAe3A/Kd8+ww5Z0PdVVYtmkcwl2TYpKByuNH87zctk=; b=kQVF+61pmugot8ltbD53I7NtcPOZnUOp12ZO2m6jzR4n4eXxRlHtY8YFJQ6bOrWaQx MJAUUXmg0F2t0/ERoMiG4Ukj9sY8Hk+XvziSO/sQ98eROtV3gVOE1v+khOziOu5hOrkF HlikxZMwE9MnG4cZqKQF/gKcNQruiR5dpvd2ggr/u0DHgOsw0GU3aQVhPJgjDC+MIXZ2 bjTiuJ5P6VYjn86p/wmsQyaHIr5vlVCrmAkOetSbbXkg5mVPexwKaIj6pMQ8bUxvbOKM tTcIE9+LbvtjhuQFCr8xait06aVj3PGe7uNMDExA54BXhBTJSQC+bdLnW+J92X2dEqpS TBTg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=upAe3A/Kd8+ww5Z0PdVVYtmkcwl2TYpKByuNH87zctk=; b=ar+I2Id0AWa9gH9mmZoR5uW8f4bBArvfuGTDtecfVAA57fhIXPdE6WUbjlX3RW5yNZ yaB8y8/KKQuvNwUjYJF9snnlB5DkiJobPDOlBBUtXD5YS7iRzPez0ZP6mR++zogKnuFN owhvLGpriI6D80Gx7B3TH9YDlcmJfDHCcqNiOPLM6SfXHy4TOBBkECG/eCjCCq6yTP2k 4QeiXNAdGFKhpktNOq1TI4tipxRiHitZmTrmOCAAKh0p8GE89Wy8IDN15yDOXDG+E5vw UCYHK9oxJ4OMUcX/3G42YUNkw5zSLzzGWnagiJrArTxjc6+xeq9ax+q7MGdswMBbjDH2 oFfQ== X-Gm-Message-State: AD7BkJKznTdvCsD25tUdGhENbgqbXvJDbJJTeOm1N/lpj8u9+h1MTbbeAnEufo7IZDSr0X8HVhzUFjKDKlUxuQ== X-Received: by 10.25.146.145 with SMTP id u139mr5793741lfd.113.1458299718457; Fri, 18 Mar 2016 04:15:18 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Nick Pentreath Date: Fri, 18 Mar 2016 11:15:08 +0000 Message-ID: Subject: Re: Spark ML - Scaling logistic regression for many features To: Daniel Siegmann Cc: "dev@spark.apache.org" Content-Type: multipart/alternative; boundary=001a1140207eb1de94052e50dc37 --001a1140207eb1de94052e50dc37 Content-Type: text/plain; charset=UTF-8 No, I didn't yet - feel free to create a JIRA. On Thu, 17 Mar 2016 at 22:55 Daniel Siegmann wrote: > Hi Nick, > > Thanks again for your help with this. Did you create a ticket in JIRA for > investigating sparse models in LR and / or multivariate summariser? If so, > can you give me the issue key(s)? If not, would you like me to create these > tickets? > > I'm going to look into this some more and see if I can figure out how to > implement these fixes. > > ~Daniel Siegmann > > On Sat, Mar 12, 2016 at 5:53 AM, Nick Pentreath > wrote: > >> Also adding dev list in case anyone else has ideas / views. >> >> On Sat, 12 Mar 2016 at 12:52, Nick Pentreath >> wrote: >> >>> Thanks for the feedback. >>> >>> I think Spark can certainly meet your use case when your data size >>> scales up, as the actual model dimension is very small - you will need to >>> use those indexers or some other mapping mechanism. >>> >>> There is ongoing work for Spark 2.0 to make it easier to use models >>> outside of Spark - also see PMML export (I think mllib logistic regression >>> is supported but I have to check that). That will help use spark models in >>> serving environments. >>> >>> Finally, I will add a JIRA to investigate sparse models for LR - maybe >>> also a ticket for multivariate summariser (though I don't think in practice >>> there will be much to gain). >>> >>> >>> On Fri, 11 Mar 2016 at 21:35, Daniel Siegmann < >>> daniel.siegmann@teamaol.com> wrote: >>> >>>> Thanks for the pointer to those indexers, those are some good examples. >>>> A good way to go for the trainer and any scoring done in Spark. I will >>>> definitely have to deal with scoring in non-Spark systems though. >>>> >>>> I think I will need to scale up beyond what single-node liblinear can >>>> practically provide. The system will need to handle much larger sub-samples >>>> of this data (and other projects might be larger still). Additionally, the >>>> system needs to train many models in parallel (hyper-parameter optimization >>>> with n-fold cross-validation, multiple algorithms, different sets of >>>> features). >>>> >>>> Still, I suppose we'll have to consider whether Spark is the best >>>> system for this. For now though, my job is to see what can be achieved with >>>> Spark. >>>> >>>> >>>> >>>> On Fri, Mar 11, 2016 at 12:45 PM, Nick Pentreath < >>>> nick.pentreath@gmail.com> wrote: >>>> >>>>> Ok, I think I understand things better now. >>>>> >>>>> For Spark's current implementation, you would need to map those >>>>> features as you mention. You could also use say StringIndexer -> >>>>> OneHotEncoder or VectorIndexer. You could create a Pipeline to deal with >>>>> the mapping and training (e.g. >>>>> http://spark.apache.org/docs/latest/ml-guide.html#example-pipeline). >>>>> Pipeline supports persistence. >>>>> >>>>> But it depends on your scoring use case too - a Spark pipeline can be >>>>> saved and then reloaded, but you need all of Spark dependencies in your >>>>> serving app which is often not ideal. If you're doing bulk scoring offline, >>>>> then it may suit. >>>>> >>>>> Honestly though, for that data size I'd certainly go with something >>>>> like Liblinear :) Spark will ultimately scale better with # training >>>>> examples for very large scale problems. However there are definitely >>>>> limitations on model dimension and sparse weight vectors currently. There >>>>> are potential solutions to these but they haven't been implemented as yet. >>>>> >>>>> On Fri, 11 Mar 2016 at 18:35 Daniel Siegmann < >>>>> daniel.siegmann@teamaol.com> wrote: >>>>> >>>>>> On Fri, Mar 11, 2016 at 5:29 AM, Nick Pentreath < >>>>>> nick.pentreath@gmail.com> wrote: >>>>>> >>>>>>> Would you mind letting us know the # training examples in the >>>>>>> datasets? Also, what do your features look like? Are they text, categorical >>>>>>> etc? You mention that most rows only have a few features, and all rows >>>>>>> together have a few 10,000s features, yet your max feature value is 20 >>>>>>> million. How are your constructing your feature vectors to get a 20 million >>>>>>> size? The only realistic way I can see this situation occurring in practice >>>>>>> is with feature hashing (HashingTF). >>>>>>> >>>>>> >>>>>> The sub-sample I'm currently training on is about 50K rows, so ... >>>>>> small. >>>>>> >>>>>> The features causing this issue are numeric (int) IDs for ... lets >>>>>> call it "Thing". For each Thing in the record, we set the feature >>>>>> Thing.id to a value of 1.0 in our vector (which is of course a >>>>>> SparseVector). I'm not sure how IDs are generated for Things, but >>>>>> they can be large numbers. >>>>>> >>>>>> The largest Thing ID is around 20 million, so that ends up being the >>>>>> size of the vector. But in fact there are fewer than 10,000 unique Thing >>>>>> IDs in this data. The mean number of features per record in what I'm >>>>>> currently training against is 41, while the maximum for any given record >>>>>> was 1754. >>>>>> >>>>>> It is possible to map the features into a small set (just need to >>>>>> zipWithIndex), but this is undesirable because of the added complexity (not >>>>>> just for the training, but also anything wanting to score against the >>>>>> model). It might be a little easier if this could be encapsulated within >>>>>> the model object itself (perhaps via composition), though I'm not sure how >>>>>> feasible that is. >>>>>> >>>>>> But I'd rather not bother with dimensionality reduction at all - >>>>>> since we can train using liblinear in just a few minutes, it doesn't seem >>>>>> necessary. >>>>>> >>>>>> >>>>>>> >>>>>>> MultivariateOnlineSummarizer uses dense arrays, but it should be >>>>>>> possible to enable sparse data. Though in theory, the result will tend to >>>>>>> be dense anyway, unless you have very many entries in the input feature >>>>>>> vector that never occur and are actually zero throughout the data set >>>>>>> (which it seems is the case with your data?). So I doubt whether using >>>>>>> sparse vectors for the summarizer would improve performance in general. >>>>>>> >>>>>> >>>>>> Yes, that is exactly my case - the vast majority of entries in the >>>>>> input feature vector will *never* occur. Presumably that means most >>>>>> of the values in the aggregators' arrays will be zero. >>>>>> >>>>>> >>>>>>> >>>>>>> LR doesn't accept a sparse weight vector, as it uses dense vectors >>>>>>> for coefficients and gradients currently. When using L1 regularization, it >>>>>>> could support sparse weight vectors, but the current implementation doesn't >>>>>>> do that yet. >>>>>>> >>>>>> >>>>>> Good to know it is theoretically possible to implement. I'll have to >>>>>> give it some thought. In the meantime I guess I'll experiment with >>>>>> coalescing the data to minimize the communication overhead. >>>>>> >>>>>> Thanks again. >>>>>> >>>>> >>>> > --001a1140207eb1de94052e50dc37 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
No, I didn't yet - feel free to create a JIRA.=C2=A0


On Thu, 17 Mar 2016 at 22:55 Daniel Siegmann <daniel.siegmann@teamaol.com> wrote:
Hi Ni= ck,

Thanks again for your help with this. Did you create a tic= ket in JIRA for investigating sparse models in LR and / or multivariate sum= mariser? If so, can you give me the issue key(s)? If not, would you like me= to create these tickets?

I'm going to look into this some= more and see if I can figure out how to implement these fixes.

~Daniel Siegmann

On Sat, Mar 12, 2016 at 5:53 AM, Nick Pen= treath <nick.pentreath@gmail.com> wrote:
Also adding dev l= ist in case anyone else has ideas / views.

On Sat, 12 Mar 2016 at 12:52, Nick Pentreath= <nick.pen= treath@gmail.com> wrote:
Tha= nks for the feedback.

I think Spark can certainly meet your use case= when your data size scales up, as the actual model dimension is very small= - you will need to use those indexers or some other mapping mechanism.
=
There is ongoing work for Spark 2.0 to make it easier to use models out= side of Spark - also see PMML export (I think mllib logistic regression is = supported but I have to check that). That will help use spark models in ser= ving environments.

Finally, I will add a JIRA to investigate sparse = models for LR - maybe also a ticket for multivariate summariser (though I d= on't think in practice there will be much to gain).


On Fri, 11 Mar 2016 at 21:35, Daniel Si= egmann <daniel.siegmann@teamaol.com> wrote:
Thanks for the pointer to those indexers, th= ose are some good examples. A good way to go for the trainer and any scorin= g done in Spark. I will definitely have to deal with scoring in non-Spark s= ystems though.

I think I will need to scale up beyond wha= t single-node liblinear can practically provide. The system will need to ha= ndle much larger sub-samples of this data (and other projects might be larg= er still). Additionally, the system needs to train many models in parallel = (hyper-parameter optimization with=20 n-fold cross-validation, multiple algorithms, different sets of=20 features).

Still, I suppose we'll have to consider whether= Spark is the best system for this. For now though, my job is to see what c= an be achieved with Spark.



On Fri, Mar 11, 2016 at 12= :45 PM, Nick Pentreath <nick.pentreath@gmail.com> wro= te:
Ok, I think I unders= tand things better now.

For Spark's current implemen= tation, you would need to map those features as you mention. You could also= use say StringIndexer -> OneHotEncoder or VectorIndexer. You could crea= te a Pipeline to deal with the mapping and training (e.g.=C2=A0http://spark.apache.org/docs/latest/ml-guide.html#example-pipel= ine). Pipeline supports persistence.

But it de= pends on your scoring use case too - a Spark pipeline can be saved and then= reloaded, but you need all of Spark dependencies in your serving app which= is often not ideal. If you're doing bulk scoring offline, then it may = suit.

Honestly though, for that data size I'd = certainly go with something like Liblinear :) Spark will ultimately scale b= etter with # training examples for very large scale problems. However there= are definitely limitations on model dimension and sparse weight vectors cu= rrently. There are potential solutions to these but they haven't been i= mplemented as yet.

On Fri, 11 Mar 2016 at 18:35 Daniel Siegmann <daniel.siegmann@teamaol.= com> wrote:
On Fri, Mar 11, 2016 at 5:29 AM, Nick Pentreath <nick.pentreath@gm= ail.com> wrote:
Would you mind letting us know the # training examples in the= datasets? Also, what do your features look like? Are they text, categorica= l etc? You mention that most rows only have a few features, and all rows to= gether have a few 10,000s features, yet your max feature value is 20 millio= n. How are your constructing your feature vectors to get a 20 million size?= The only realistic way I can see this situation occurring in practice is w= ith feature hashing (HashingTF).

The sub-sample I'm currently training on is about= 50K rows, so ... small.

The features causing this issue = are numeric (int) ID= s for ... lets call it "Thing". For each Thing in the record, we = set the feature Thing.id to a value of 1.0 in our vector (which is of course a SparseVector). I'm not sure how IDs are generated fo= r Things, but they can be large numbers.

The largest Thin= g ID is around 20 million, so that ends up being the size of the vector. Bu= t in fact there are fewer than 10,000 unique Thing IDs in this data. The me= an number of features per record in what I'm currently training against= is 41, while the maximum for any given record was 1754.

= It is possible to map the features into a small set (just need to zipWithIn= dex), but this is undesirable because of the added complexity (not just for= the training, but also anything wanting to score against the model). It mi= ght be a little easier if this could be encapsulated within the model objec= t itself (perhaps via composition), though I'm not sure how feasible th= at is.

But I'd rather not bother with dimensionality = reduction at all - since we can train using liblinear in just a few minutes= , it doesn't seem necessary.
=C2=A0

MultivariateOnlineSummari= zer=C2=A0uses dense arrays, but it should be possible to enable spar= se data. Though in theory, the result will tend to be dense anyway, unless = you have very many entries in the input feature vector that never occur and= are actually zero throughout the data set (which it seems is the case with= your data?). So I doubt whether using sparse vectors for the summarizer wo= uld improve performance in general.

=
Yes, that is exactly my case - the vast majority of entr= ies in the input feature vector will never occur. Presumably that me= ans most of the values in the aggregators' arrays will be zero.
=C2=A0

LR doesn't accept a sparse weight vector, = as it uses dense vectors for coefficients and gradients currently. When usi= ng L1 regularization, it could support sparse weight vectors, but the curre= nt implementation doesn't do that yet.
Good to know it is theoretically possible to impl= ement. I'll have to give it some thought. In the meantime I guess I'= ;ll experiment with coalescing the data to minimize the communication overh= ead.

Thanks again.


--001a1140207eb1de94052e50dc37--