spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dongjin Lee <dong...@apache.org>
Subject Re: Spark Local Pipelines
Date Mon, 13 Mar 2017 15:08:29 GMT
Although I love the cool idea of Asher, I'd rather +1 for Sean's view; I
think it would be much better to live outside of the project.

Best,
Dongjin

On Mon, Mar 13, 2017 at 5:39 PM, Sean Owen <sowen@cloudera.com> wrote:

> I'm skeptical.  Serving synchronous queries from a model at scale is a
> fundamentally different activity. As you note, it doesn't logically involve
> Spark. If it has to happen in milliseconds it's going to be in-core.
> Scoring even 10qps with a Spark job per request is probably a non-starter;
> think of the thousands of tasks per second and the overhead of just
> tracking them.
>
> When you say the RDDs support point prediction, I think you mean that
> those older models expose a method to score a Vector. They are not somehow
> exposing distributed point prediction. You could add this to the newer
> models, but it raises the question of how to make the Row to feed it? the
> .mllib punts on this and assumes you can construct the Vector.
>
> I think this sweeps a lot under the rug in assuming that there can just be
> a "local" version of every Transformer -- but, even if there could be,
> consider how much extra implementation that is. Lots of them probably could
> be but I'm not sure that all can.
>
> The bigger problem in my experience is the Pipelines don't generally
> encapsulate the entire pipeline from source data to score. They encapsulate
> the part after computing underlying features. That is, if one of your
> features is "total clicks from this user", that's the product of a
> DataFrame operation that precedes a Pipeline. This can't be turned into a
> non-distributed, non-Spark local version.
>
> Solving subsets of this problem could still be useful, and you've
> highlighted some external projects that try. I'd also highlight PMML as an
> established interchange format for just the model part, and for cases that
> don't involve much or any pipeline, it's a better fit paired with a library
> that can score from PMML.
>
> I think this is one of those things that could live outside the project,
> because it's more not-Spark than Spark. Remember too that building a
> solution into the project blesses one at the expense of others.
>
>
> On Sun, Mar 12, 2017 at 10:15 PM Asher Krim <akrim@hubspot.com> wrote:
>
>> Hi All,
>>
>> I spent a lot of time at Spark Summit East this year talking with Spark
>> developers and committers about challenges with productizing Spark. One of
>> the biggest shortcomings I've encountered in Spark ML pipelines is the lack
>> of a way to serve single requests with any reasonable performance.
>> SPARK-10413 explores adding methods for single item prediction, but I'd
>> like to explore a more holistic approach - a separate local api, with
>> models that support transformations without depending on Spark at all.
>>
>> I've written up a doc
>> <https://docs.google.com/document/d/1Ha4DRMio5A7LjPqiHUnwVzbaxbev6ys04myyz6nDgI4/edit?usp=sharing>
>> detailing the approach, and I'm happy to discuss alternatives. If this
>> gains traction, I can create a branch with a minimal example on a simple
>> transformer (probably something like CountVectorizerModel) so we have
>> something concrete to continue the discussion on.
>>
>> Thanks,
>> Asher Krim
>> Senior Software Engineer
>>
>


-- 
*Dongjin Lee*


*Software developer in Line+.So interested in massive-scale machine
learning.facebook: www.facebook.com/dongjin.lee.kr
<http://www.facebook.com/dongjin.lee.kr>linkedin:
kr.linkedin.com/in/dongjinleekr
<http://kr.linkedin.com/in/dongjinleekr>github:
<http://goog_969573159/>github.com/dongjinleekr
<http://github.com/dongjinleekr>twitter: www.twitter.com/dongjinleekr
<http://www.twitter.com/dongjinleekr>*

Mime
View raw message