spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Pentreath <nick.pentre...@gmail.com>
Subject Re: Spark ML - Scaling logistic regression for many features
Date Fri, 18 Mar 2016 11:15:08 GMT
No, I didn't yet - feel free to create a JIRA.



On Thu, 17 Mar 2016 at 22:55 Daniel Siegmann <daniel.siegmann@teamaol.com>
wrote:

> Hi Nick,
>
> Thanks again for your help with this. Did you create a ticket in JIRA for
> investigating sparse models in LR and / or multivariate summariser? If so,
> can you give me the issue key(s)? If not, would you like me to create these
> tickets?
>
> I'm going to look into this some more and see if I can figure out how to
> implement these fixes.
>
> ~Daniel Siegmann
>
> On Sat, Mar 12, 2016 at 5:53 AM, Nick Pentreath <nick.pentreath@gmail.com>
> wrote:
>
>> Also adding dev list in case anyone else has ideas / views.
>>
>> On Sat, 12 Mar 2016 at 12:52, Nick Pentreath <nick.pentreath@gmail.com>
>> wrote:
>>
>>> Thanks for the feedback.
>>>
>>> I think Spark can certainly meet your use case when your data size
>>> scales up, as the actual model dimension is very small - you will need to
>>> use those indexers or some other mapping mechanism.
>>>
>>> There is ongoing work for Spark 2.0 to make it easier to use models
>>> outside of Spark - also see PMML export (I think mllib logistic regression
>>> is supported but I have to check that). That will help use spark models in
>>> serving environments.
>>>
>>> Finally, I will add a JIRA to investigate sparse models for LR - maybe
>>> also a ticket for multivariate summariser (though I don't think in practice
>>> there will be much to gain).
>>>
>>>
>>> On Fri, 11 Mar 2016 at 21:35, Daniel Siegmann <
>>> daniel.siegmann@teamaol.com> wrote:
>>>
>>>> Thanks for the pointer to those indexers, those are some good examples.
>>>> A good way to go for the trainer and any scoring done in Spark. I will
>>>> definitely have to deal with scoring in non-Spark systems though.
>>>>
>>>> I think I will need to scale up beyond what single-node liblinear can
>>>> practically provide. The system will need to handle much larger sub-samples
>>>> of this data (and other projects might be larger still). Additionally, the
>>>> system needs to train many models in parallel (hyper-parameter optimization
>>>> with n-fold cross-validation, multiple algorithms, different sets of
>>>> features).
>>>>
>>>> Still, I suppose we'll have to consider whether Spark is the best
>>>> system for this. For now though, my job is to see what can be achieved with
>>>> Spark.
>>>>
>>>>
>>>>
>>>> On Fri, Mar 11, 2016 at 12:45 PM, Nick Pentreath <
>>>> nick.pentreath@gmail.com> wrote:
>>>>
>>>>> Ok, I think I understand things better now.
>>>>>
>>>>> For Spark's current implementation, you would need to map those
>>>>> features as you mention. You could also use say StringIndexer ->
>>>>> OneHotEncoder or VectorIndexer. You could create a Pipeline to deal with
>>>>> the mapping and training (e.g.
>>>>> http://spark.apache.org/docs/latest/ml-guide.html#example-pipeline).
>>>>> Pipeline supports persistence.
>>>>>
>>>>> But it depends on your scoring use case too - a Spark pipeline can be
>>>>> saved and then reloaded, but you need all of Spark dependencies in your
>>>>> serving app which is often not ideal. If you're doing bulk scoring offline,
>>>>> then it may suit.
>>>>>
>>>>> Honestly though, for that data size I'd certainly go with something
>>>>> like Liblinear :) Spark will ultimately scale better with # training
>>>>> examples for very large scale problems. However there are definitely
>>>>> limitations on model dimension and sparse weight vectors currently. There
>>>>> are potential solutions to these but they haven't been implemented as
yet.
>>>>>
>>>>> On Fri, 11 Mar 2016 at 18:35 Daniel Siegmann <
>>>>> daniel.siegmann@teamaol.com> wrote:
>>>>>
>>>>>> On Fri, Mar 11, 2016 at 5:29 AM, Nick Pentreath <
>>>>>> nick.pentreath@gmail.com> wrote:
>>>>>>
>>>>>>> Would you mind letting us know the # training examples in the
>>>>>>> datasets? Also, what do your features look like? Are they text,
categorical
>>>>>>> etc? You mention that most rows only have a few features, and
all rows
>>>>>>> together have a few 10,000s features, yet your max feature value
is 20
>>>>>>> million. How are your constructing your feature vectors to get
a 20 million
>>>>>>> size? The only realistic way I can see this situation occurring
in practice
>>>>>>> is with feature hashing (HashingTF).
>>>>>>>
>>>>>>
>>>>>> The sub-sample I'm currently training on is about 50K rows, so ...
>>>>>> small.
>>>>>>
>>>>>> The features causing this issue are numeric (int) IDs for ... lets
>>>>>> call it "Thing". For each Thing in the record, we set the feature
>>>>>> Thing.id to a value of 1.0 in our vector (which is of course a
>>>>>> SparseVector). I'm not sure how IDs are generated for Things, but
>>>>>> they can be large numbers.
>>>>>>
>>>>>> The largest Thing ID is around 20 million, so that ends up being
the
>>>>>> size of the vector. But in fact there are fewer than 10,000 unique
Thing
>>>>>> IDs in this data. The mean number of features per record in what
I'm
>>>>>> currently training against is 41, while the maximum for any given
record
>>>>>> was 1754.
>>>>>>
>>>>>> It is possible to map the features into a small set (just need to
>>>>>> zipWithIndex), but this is undesirable because of the added complexity
(not
>>>>>> just for the training, but also anything wanting to score against
the
>>>>>> model). It might be a little easier if this could be encapsulated
within
>>>>>> the model object itself (perhaps via composition), though I'm not
sure how
>>>>>> feasible that is.
>>>>>>
>>>>>> But I'd rather not bother with dimensionality reduction at all -
>>>>>> since we can train using liblinear in just a few minutes, it doesn't
seem
>>>>>> necessary.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> MultivariateOnlineSummarizer uses dense arrays, but it should
be
>>>>>>> possible to enable sparse data. Though in theory, the result
will tend to
>>>>>>> be dense anyway, unless you have very many entries in the input
feature
>>>>>>> vector that never occur and are actually zero throughout the
data set
>>>>>>> (which it seems is the case with your data?). So I doubt whether
using
>>>>>>> sparse vectors for the summarizer would improve performance in
general.
>>>>>>>
>>>>>>
>>>>>> Yes, that is exactly my case - the vast majority of entries in the
>>>>>> input feature vector will *never* occur. Presumably that means most
>>>>>> of the values in the aggregators' arrays will be zero.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> LR doesn't accept a sparse weight vector, as it uses dense vectors
>>>>>>> for coefficients and gradients currently. When using L1 regularization,
it
>>>>>>> could support sparse weight vectors, but the current implementation
doesn't
>>>>>>> do that yet.
>>>>>>>
>>>>>>
>>>>>> Good to know it is theoretically possible to implement. I'll have
to
>>>>>> give it some thought. In the meantime I guess I'll experiment with
>>>>>> coalescing the data to minimize the communication overhead.
>>>>>>
>>>>>> Thanks again.
>>>>>>
>>>>>
>>>>
>

Mime
View raw message