Mailing-List: contact dev-help@spark.apache.org; run by ezmlm
Precedence: bulk
MIME-Version: 1.0
References: 
 <CALPKL-y9qxXyBBvBwK_C96BEE0wjUb=pp6kXxYU7OPbKc5vwDA@mail.gmail.com>
 <CAB3zYNJD0esADnKL_40OuDrmeca9iMntaV0_CE7CiekBeqNcag@mail.gmail.com>
 <CALPKL-xascAjmx3nWYvSGpH9ntfDtOyxYrmajizWQ=ZuZuEwtQ@mail.gmail.com>
 <CAB3zYN+Nec59YAS-6+WywV9ymVa-7BtnGPYxOytZjZX-vFU6Uw@mail.gmail.com>
 <CAFEbBjj-kRWTyy-+G2bGypOGi1Lyf7f3ac8zWAt6CLTZR5rE=Q@mail.gmail.com>
 <CALPKL-xiBLksQReTNse5vS3J9h9fk8PcPgz+gavkk_yitpfMXA@mail.gmail.com>
 <CALD+6GM4j3ObYzC5NSw2SFrqNjq5P2rk0DzpwRf+fwa1njqdAQ@mail.gmail.com>
 <CALPKL-yiSWOvCNnS7M1ctNMka44AfhqA287Vi8N-MOinUKp2gQ@mail.gmail.com>
 <CALD+6GN8v_RRtfKJEOaXs8b3+99deS8+J=6Rh_TAScQi74HVVg@mail.gmail.com>
 <CALPKL-wwppsWL_ZaVROZkW2egPG35JpKAEfeXnTzumhetsBn_A@mail.gmail.com>
 <CALD+6GM4vSc_--2ACCK=bUVs5crME6LhEpFdCsDx3GwvgiV-2Q@mail.gmail.com>
 <CALPKL-wbZW2RTB5mE6Xnm5-Y+sJTcv+HuYnREvk4o+vVRAGRVQ@mail.gmail.com>
 <CALD+6GPLVaNGJQwS-audWf39g+ek5Nsj7vExr=1v5D8+kbaw0w@mail.gmail.com>
 <CALD+6GOm9AuqWta_1UQ8u00cOJiE1wKnz35itFPvVeWvYUCHhQ@mail.gmail.com>
 <CALPKL-wUQxLov4NOxi_iSc-9G9EEqk+zJaK9=ngAhObYjDnmVw@mail.gmail.com>
In-Reply-To: 
 <CALPKL-wUQxLov4NOxi_iSc-9G9EEqk+zJaK9=ngAhObYjDnmVw@mail.gmail.com>
From: Nick Pentreath <nick.pentreath@gmail.com>
Date: Fri, 18 Mar 2016 11:15:08 +0000
Message-ID: 
 <CALD+6GMNJL3YqPyH+PH5LhZMgM__o_2JN=DRgyY=OstvTrnEvA@mail.gmail.com>
Subject: Re: Spark ML - Scaling logistic regression for many features
To: Daniel Siegmann <daniel.siegmann@teamaol.com>
Cc: "dev@spark.apache.org" <dev@spark.apache.org>
Content-Type: multipart/alternative; boundary=001a1140207eb1de94052e50dc37

--001a1140207eb1de94052e50dc37
Content-Type: text/plain; charset=UTF-8

No, I didn't yet - feel free to create a JIRA.


On Thu, 17 Mar 2016 at 22:55 Daniel Siegmann <daniel.siegmann@teamaol.com>
wrote:

> Hi Nick,
>
> Thanks again for your help with this. Did you create a ticket in JIRA for
> investigating sparse models in LR and / or multivariate summariser? If so,
> can you give me the issue key(s)? If not, would you like me to create these
> tickets?
>
> I'm going to look into this some more and see if I can figure out how to
> implement these fixes.
>
> ~Daniel Siegmann
>
> On Sat, Mar 12, 2016 at 5:53 AM, Nick Pentreath <nick.pentreath@gmail.com>
> wrote:
>
>> Also adding dev list in case anyone else has ideas / views.
>>
>> On Sat, 12 Mar 2016 at 12:52, Nick Pentreath <nick.pentreath@gmail.com>
>> wrote:
>>
>>> Thanks for the feedback.
>>>
>>> I think Spark can certainly meet your use case when your data size
>>> scales up, as the actual model dimension is very small - you will need to
>>> use those indexers or some other mapping mechanism.
>>>
>>> There is ongoing work for Spark 2.0 to make it easier to use models
>>> outside of Spark - also see PMML export (I think mllib logistic regression
>>> is supported but I have to check that). That will help use spark models in
>>> serving environments.
>>>
>>> Finally, I will add a JIRA to investigate sparse models for LR - maybe
>>> also a ticket for multivariate summariser (though I don't think in practice
>>> there will be much to gain).
>>>
>>>
>>> On Fri, 11 Mar 2016 at 21:35, Daniel Siegmann <
>>> daniel.siegmann@teamaol.com> wrote:
>>>
>>>> Thanks for the pointer to those indexers, those are some good examples.
>>>> A good way to go for the trainer and any scoring done in Spark. I will
>>>> definitely have to deal with scoring in non-Spark systems though.
>>>>
>>>> I think I will need to scale up beyond what single-node liblinear can
>>>> practically provide. The system will need to handle much larger sub-samples
>>>> of this data (and other projects might be larger still). Additionally, the
>>>> system needs to train many models in parallel (hyper-parameter optimization
>>>> with n-fold cross-validation, multiple algorithms, different sets of
>>>> features).
>>>>
>>>> Still, I suppose we'll have to consider whether Spark is the best
>>>> system for this. For now though, my job is to see what can be achieved with
>>>> Spark.
>>>>
>>>>
>>>>
>>>> On Fri, Mar 11, 2016 at 12:45 PM, Nick Pentreath <
>>>> nick.pentreath@gmail.com> wrote:
>>>>
>>>>> Ok, I think I understand things better now.
>>>>>
>>>>> For Spark's current implementation, you would need to map those
>>>>> features as you mention. You could also use say StringIndexer ->
>>>>> OneHotEncoder or VectorIndexer. You could create a Pipeline to deal with
>>>>> the mapping and training (e.g.
>>>>> http://spark.apache.org/docs/latest/ml-guide.html#example-pipeline).
>>>>> Pipeline supports persistence.
>>>>>
>>>>> But it depends on your scoring use case too - a Spark pipeline can be
>>>>> saved and then reloaded, but you need all of Spark dependencies in your
>>>>> serving app which is often not ideal. If you're doing bulk scoring offline,
>>>>> then it may suit.
>>>>>
>>>>> Honestly though, for that data size I'd certainly go with something
>>>>> like Liblinear :) Spark will ultimately scale better with # training
>>>>> examples for very large scale problems. However there are definitely
>>>>> limitations on model dimension and sparse weight vectors currently. There
>>>>> are potential solutions to these but they haven't been implemented as yet.
>>>>>
>>>>> On Fri, 11 Mar 2016 at 18:35 Daniel Siegmann <
>>>>> daniel.siegmann@teamaol.com> wrote:
>>>>>
>>>>>> On Fri, Mar 11, 2016 at 5:29 AM, Nick Pentreath <
>>>>>> nick.pentreath@gmail.com> wrote:
>>>>>>
>>>>>>> Would you mind letting us know the # training examples in the
>>>>>>> datasets? Also, what do your features look like? Are they text, categorical
>>>>>>> etc? You mention that most rows only have a few features, and all rows
>>>>>>> together have a few 10,000s features, yet your max feature value is 20
>>>>>>> million. How are your constructing your feature vectors to get a 20 million
>>>>>>> size? The only realistic way I can see this situation occurring in practice
>>>>>>> is with feature hashing (HashingTF).
>>>>>>>
>>>>>>
>>>>>> The sub-sample I'm currently training on is about 50K rows, so ...
>>>>>> small.
>>>>>>
>>>>>> The features causing this issue are numeric (int) IDs for ... lets
>>>>>> call it "Thing". For each Thing in the record, we set the feature
>>>>>> Thing.id to a value of 1.0 in our vector (which is of course a
>>>>>> SparseVector). I'm not sure how IDs are generated for Things, but
>>>>>> they can be large numbers.
>>>>>>
>>>>>> The largest Thing ID is around 20 million, so that ends up being the
>>>>>> size of the vector. But in fact there are fewer than 10,000 unique Thing
>>>>>> IDs in this data. The mean number of features per record in what I'm
>>>>>> currently training against is 41, while the maximum for any given record
>>>>>> was 1754.
>>>>>>
>>>>>> It is possible to map the features into a small set (just need to
>>>>>> zipWithIndex), but this is undesirable because of the added complexity (not
>>>>>> just for the training, but also anything wanting to score against the
>>>>>> model). It might be a little easier if this could be encapsulated within
>>>>>> the model object itself (perhaps via composition), though I'm not sure how
>>>>>> feasible that is.
>>>>>>
>>>>>> But I'd rather not bother with dimensionality reduction at all -
>>>>>> since we can train using liblinear in just a few minutes, it doesn't seem
>>>>>> necessary.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> MultivariateOnlineSummarizer uses dense arrays, but it should be
>>>>>>> possible to enable sparse data. Though in theory, the result will tend to
>>>>>>> be dense anyway, unless you have very many entries in the input feature
>>>>>>> vector that never occur and are actually zero throughout the data set
>>>>>>> (which it seems is the case with your data?). So I doubt whether using
>>>>>>> sparse vectors for the summarizer would improve performance in general.
>>>>>>>
>>>>>>
>>>>>> Yes, that is exactly my case - the vast majority of entries in the
>>>>>> input feature vector will *never* occur. Presumably that means most
>>>>>> of the values in the aggregators' arrays will be zero.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> LR doesn't accept a sparse weight vector, as it uses dense vectors
>>>>>>> for coefficients and gradients currently. When using L1 regularization, it
>>>>>>> could support sparse weight vectors, but the current implementation doesn't
>>>>>>> do that yet.
>>>>>>>
>>>>>>
>>>>>> Good to know it is theoretically possible to implement. I'll have to
>>>>>> give it some thought. In the meantime I guess I'll experiment with
>>>>>> coalescing the data to minimize the communication overhead.
>>>>>>
>>>>>> Thanks again.
>>>>>>
>>>>>
>>>>
>

--001a1140207eb1de94052e50dc37
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">No, I didn&#39;t yet - feel free to create a JIRA.=C2=A0<d=
iv><br></div><div><br></div></div><br><div class=3D"gmail_quote"><div dir=
=3D"ltr">On Thu, 17 Mar 2016 at 22:55 Daniel Siegmann &lt;<a href=3D"mailto=
:daniel.siegmann@teamaol.com">daniel.siegmann@teamaol.com</a>&gt; wrote:<br=
></div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-=
left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div><div><div>Hi Ni=
ck,<br><br></div>Thanks again for your help with this. Did you create a tic=
ket in JIRA for investigating sparse models in LR and / or multivariate sum=
mariser? If so, can you give me the issue key(s)? If not, would you like me=
 to create these tickets?<br><br></div>I&#39;m going to look into this some=
 more and see if I can figure out how to implement these fixes.<br><br></di=
v></div><div dir=3D"ltr">~Daniel Siegmann<br></div><div class=3D"gmail_extr=
a"><br><div class=3D"gmail_quote">On Sat, Mar 12, 2016 at 5:53 AM, Nick Pen=
treath <span dir=3D"ltr">&lt;<a href=3D"mailto:nick.pentreath@gmail.com" ta=
rget=3D"_blank">nick.pentreath@gmail.com</a>&gt;</span> wrote:<br><blockquo=
te class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc so=
lid;padding-left:1ex"><div style=3D"white-space:pre-wrap">Also adding dev l=
ist in case anyone else has ideas / views.</div><div><div><br><div class=3D=
"gmail_quote"><div dir=3D"ltr">On Sat, 12 Mar 2016 at 12:52, Nick Pentreath=
 &lt;<a href=3D"mailto:nick.pentreath@gmail.com" target=3D"_blank">nick.pen=
treath@gmail.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" =
style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Tha=
nks for the feedback.<br><br>I think Spark can certainly meet your use case=
 when your data size scales up, as the actual model dimension is very small=
 - you will need to use those indexers or some other mapping mechanism.<br>=
<br>There is ongoing work for Spark 2.0 to make it easier to use models out=
side of Spark - also see PMML export (I think mllib logistic regression is =
supported but I have to check that). That will help use spark models in ser=
ving environments.<br><br>Finally, I will add a JIRA to investigate sparse =
models for LR - maybe also a ticket for multivariate summariser (though I d=
on&#39;t think in practice there will be much to gain).<br><br><br><div cla=
ss=3D"gmail_quote"><div dir=3D"ltr">On Fri, 11 Mar 2016 at 21:35, Daniel Si=
egmann &lt;<a href=3D"mailto:daniel.siegmann@teamaol.com" target=3D"_blank"=
>daniel.siegmann@teamaol.com</a>&gt; wrote:<br></div><blockquote class=3D"g=
mail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-l=
eft:1ex"><div dir=3D"ltr"><div>Thanks for the pointer to those indexers, th=
ose are some good examples. A good way to go for the trainer and any scorin=
g done in Spark. I will definitely have to deal with scoring in non-Spark s=
ystems though.<br></div><div><br>I think I will need to scale up beyond wha=
t single-node liblinear can practically provide. The system will need to ha=
ndle much larger sub-samples of this data (and other projects might be larg=
er still). Additionally, the system needs to train many models in parallel =
(hyper-parameter optimization with=20
n-fold cross-validation, multiple algorithms, different sets of=20
features).<br><br></div>Still, I suppose we&#39;ll have to consider whether=
 Spark is the best system for this. For now though, my job is to see what c=
an be achieved with Spark.<br><div><div><br><br></div></div></div><div clas=
s=3D"gmail_extra"><br><div class=3D"gmail_quote">On Fri, Mar 11, 2016 at 12=
:45 PM, Nick Pentreath <span dir=3D"ltr">&lt;<a href=3D"mailto:nick.pentrea=
th@gmail.com" target=3D"_blank">nick.pentreath@gmail.com</a>&gt;</span> wro=
te:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-=
left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">Ok, I think I unders=
tand things better now.<div><br></div><div>For Spark&#39;s current implemen=
tation, you would need to map those features as you mention. You could also=
 use say StringIndexer -&gt; OneHotEncoder or VectorIndexer. You could crea=
te a Pipeline to deal with the mapping and training (e.g.=C2=A0<a href=3D"h=
ttp://spark.apache.org/docs/latest/ml-guide.html#example-pipeline" target=
=3D"_blank">http://spark.apache.org/docs/latest/ml-guide.html#example-pipel=
ine</a>). Pipeline supports persistence.</div><div><br></div><div>But it de=
pends on your scoring use case too - a Spark pipeline can be saved and then=
 reloaded, but you need all of Spark dependencies in your serving app which=
 is often not ideal. If you&#39;re doing bulk scoring offline, then it may =
suit.</div><div><br></div><div>Honestly though, for that data size I&#39;d =
certainly go with something like Liblinear :) Spark will ultimately scale b=
etter with # training examples for very large scale problems. However there=
 are definitely limitations on model dimension and sparse weight vectors cu=
rrently. There are potential solutions to these but they haven&#39;t been i=
mplemented as yet.</div><div><div><div><br><div class=3D"gmail_quote"><div =
dir=3D"ltr">On Fri, 11 Mar 2016 at 18:35 Daniel Siegmann &lt;<a href=3D"mai=
lto:daniel.siegmann@teamaol.com" target=3D"_blank">daniel.siegmann@teamaol.=
com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"marg=
in:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr"=
>On Fri, Mar 11, 2016 at 5:29 AM, Nick Pentreath <span dir=3D"ltr">&lt;<a h=
ref=3D"mailto:nick.pentreath@gmail.com" target=3D"_blank">nick.pentreath@gm=
ail.com</a>&gt;</span> wrote:<br></div><div dir=3D"ltr"><div class=3D"gmail=
_extra"><div class=3D"gmail_quote"><blockquote class=3D"gmail_quote" style=
=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=
=3D"ltr"><div>Would you mind letting us know the # training examples in the=
 datasets? Also, what do your features look like? Are they text, categorica=
l etc? You mention that most rows only have a few features, and all rows to=
gether have a few 10,000s features, yet your max feature value is 20 millio=
n. How are your constructing your feature vectors to get a 20 million size?=
 The only realistic way I can see this situation occurring in practice is w=
ith feature hashing (HashingTF).<br></div></div></blockquote><div><br></div=
></div></div></div><div dir=3D"ltr"><div class=3D"gmail_extra"><div class=
=3D"gmail_quote"><div>The sub-sample I&#39;m currently training on is about=
 50K rows, so ... small.<br><br></div><div>The features causing this issue =
are numeric (<span style=3D"font-family:monospace,monospace">int</span>) ID=
s for ... lets call it &quot;Thing&quot;. For each Thing in the record, we =
set the feature <span style=3D"font-family:monospace,monospace">Thing.id</s=
pan> to a value of <span style=3D"font-family:monospace,monospace">1.0</spa=
n> in our vector (which is of course a <span style=3D"font-family:monospace=
,monospace">SparseVector</span>). I&#39;m not sure how IDs are generated fo=
r Things, but they can be large numbers.<br><br></div><div>The largest Thin=
g ID is around 20 million, so that ends up being the size of the vector. Bu=
t in fact there are fewer than 10,000 unique Thing IDs in this data. The me=
an number of features per record in what I&#39;m currently training against=
 is 41, while the maximum for any given record was 1754.<br><br></div><div>=
It is possible to map the features into a small set (just need to zipWithIn=
dex), but this is undesirable because of the added complexity (not just for=
 the training, but also anything wanting to score against the model). It mi=
ght be a little easier if this could be encapsulated within the model objec=
t itself (perhaps via composition), though I&#39;m not sure how feasible th=
at is.<br><br></div><div>But I&#39;d rather not bother with dimensionality =
reduction at all - since we can train using liblinear in just a few minutes=
, it doesn&#39;t seem necessary.<br></div></div></div></div><div dir=3D"ltr=
"><div class=3D"gmail_extra"><div class=3D"gmail_quote"><div>=C2=A0</div><b=
lockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px =
#ccc solid;padding-left:1ex"><div dir=3D"ltr"><div></div><div><br></div><di=
v><span style=3D"font-family:monospace,monospace">MultivariateOnlineSummari=
zer</span>=C2=A0uses dense arrays, but it should be possible to enable spar=
se data. Though in theory, the result will tend to be dense anyway, unless =
you have very many entries in the input feature vector that never occur and=
 are actually zero throughout the data set (which it seems is the case with=
 your data?). So I doubt whether using sparse vectors for the summarizer wo=
uld improve performance in general.</div></div></blockquote><div><br></div>=
</div></div></div><div dir=3D"ltr"><div class=3D"gmail_extra"><div class=3D=
"gmail_quote"><div>Yes, that is exactly my case - the vast majority of entr=
ies in the input feature vector will <i>never</i> occur. Presumably that me=
ans most of the values in the aggregators&#39; arrays will be zero.<br></di=
v></div></div></div><div dir=3D"ltr"><div class=3D"gmail_extra"><div class=
=3D"gmail_quote"><div>=C2=A0</div><blockquote class=3D"gmail_quote" style=
=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=
=3D"ltr"><div><br></div><div>LR doesn&#39;t accept a sparse weight vector, =
as it uses dense vectors for coefficients and gradients currently. When usi=
ng L1 regularization, it could support sparse weight vectors, but the curre=
nt implementation doesn&#39;t do that yet.</div></div></blockquote><div><br=
></div></div></div></div><div dir=3D"ltr"><div class=3D"gmail_extra"><div c=
lass=3D"gmail_quote"><div>Good to know it is theoretically possible to impl=
ement. I&#39;ll have to give it some thought. In the meantime I guess I&#39=
;ll experiment with coalescing the data to minimize the communication overh=
ead.<br><br></div><div>Thanks again.<br></div></div></div></div>
</blockquote></div></div></div></div></div>
</blockquote></div><br></div>
</blockquote></div></blockquote></div>
</div></div></blockquote></div><br></div>
</blockquote></div>

--001a1140207eb1de94052e50dc37--