spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Holden Karau <hol...@pigscanfly.ca>
Subject Re: Revisiting Online serving of Spark models?
Date Tue, 29 May 2018 16:47:30 GMT
I'm down for that, we could all go for a walk maybe to the mint plazaa blue
bottle and grab coffee (if the weather holds have our design meeting
outside :p)?

On Tue, May 29, 2018 at 9:37 AM, Felix Cheung <felixcheung_m@hotmail.com>
wrote:

> Bump.
>
> ------------------------------
> *From:* Felix Cheung <felixcheung_m@hotmail.com>
> *Sent:* Saturday, May 26, 2018 1:05:29 PM
> *To:* Saikat Kanjilal; Maximiliano Felice; Joseph Bradley
> *Cc:* Leif Walsh; Holden Karau; dev
>
> *Subject:* Re: Revisiting Online serving of Spark models?
>
> Hi! How about we meet the community and discuss on June 6 4pm at (near)
> the Summit?
>
> (I propose we meet at the venue entrance so we could accommodate people
> might not be in the conference)
>
> ------------------------------
> *From:* Saikat Kanjilal <sxk1969@hotmail.com>
> *Sent:* Tuesday, May 22, 2018 7:47:07 AM
> *To:* Maximiliano Felice
> *Cc:* Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev
> *Subject:* Re: Revisiting Online serving of Spark models?
>
> I’m in the same exact boat as Maximiliano and have use cases as well for
> model serving and would love to join this discussion.
>
> Sent from my iPhone
>
> On May 22, 2018, at 6:39 AM, Maximiliano Felice <
> maximilianofelice@gmail.com> wrote:
>
> Hi!
>
> I'm don't usually write a lot on this list but I keep up to date with the
> discussions and I'm a heavy user of Spark. This topic caught my attention,
> as we're currently facing this issue at work. I'm attending to the summit
> and was wondering if it would it be possible for me to join that meeting. I
> might be able to share some helpful usecases and ideas.
>
> Thanks,
> Maximiliano Felice
>
> El mar., 22 de may. de 2018 9:14 AM, Leif Walsh <leif.walsh@gmail.com>
> escribió:
>
>> I’m with you on json being more readable than parquet, but we’ve had
>> success using pyarrow’s parquet reader and have been quite happy with it so
>> far. If your target is python (and probably if not now, then soon, R), you
>> should look in to it.
>>
>> On Mon, May 21, 2018 at 16:52 Joseph Bradley <joseph@databricks.com>
>> wrote:
>>
>>> Regarding model reading and writing, I'll give quick thoughts here:
>>> * Our approach was to use the same format but write JSON instead of
>>> Parquet.  It's easier to parse JSON without Spark, and using the same
>>> format simplifies architecture.  Plus, some people want to check files into
>>> version control, and JSON is nice for that.
>>> * The reader/writer APIs could be extended to take format parameters
>>> (just like DataFrame reader/writers) to handle JSON (and maybe, eventually,
>>> handle Parquet in the online serving setting).
>>>
>>> This would be a big project, so proposing a SPIP might be best.  If
>>> people are around at the Spark Summit, that could be a good time to meet up
>>> & then post notes back to the dev list.
>>>
>>> On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <felixcheung_m@hotmail.com
>>> > wrote:
>>>
>>>> Specifically I’d like bring part of the discussion to Model and
>>>> PipelineModel, and various ModelReader and SharedReadWrite implementations
>>>> that rely on SparkContext. This is a big blocker on reusing  trained models
>>>> outside of Spark for online serving.
>>>>
>>>> What’s the next step? Would folks be interested in getting together to
>>>> discuss/get some feedback?
>>>>
>>>>
>>>> _____________________________
>>>> From: Felix Cheung <felixcheung_m@hotmail.com>
>>>> Sent: Thursday, May 10, 2018 10:10 AM
>>>> Subject: Re: Revisiting Online serving of Spark models?
>>>> To: Holden Karau <holden@pigscanfly.ca>, Joseph Bradley <
>>>> joseph@databricks.com>
>>>> Cc: dev <dev@spark.apache.org>
>>>>
>>>>
>>>>
>>>> Huge +1 on this!
>>>>
>>>> ------------------------------
>>>> *From:* holden.karau@gmail.com <holden.karau@gmail.com> on behalf of
>>>> Holden Karau <holden@pigscanfly.ca>
>>>> *Sent:* Thursday, May 10, 2018 9:39:26 AM
>>>> *To:* Joseph Bradley
>>>> *Cc:* dev
>>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>>
>>>>
>>>>
>>>> On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <joseph@databricks.com>
>>>> wrote:
>>>>
>>>>> Thanks for bringing this up Holden!  I'm a strong supporter of this.
>>>>>
>>>>> Awesome! I'm glad other folks think something like this belongs in
>>>> Spark.
>>>>
>>>>> This was one of the original goals for mllib-local: to have local
>>>>> versions of MLlib models which could be deployed without the big Spark
JARs
>>>>> and without a SparkContext or SparkSession.  There are related commercial
>>>>> offerings like this : ) but the overhead of maintaining those offerings
is
>>>>> pretty high.  Building good APIs within MLlib to avoid copying logic
across
>>>>> libraries will be well worth it.
>>>>>
>>>>> We've talked about this need at Databricks and have also been syncing
>>>>> with the creators of MLeap.  It'd be great to get this functionality
into
>>>>> Spark itself.  Some thoughts:
>>>>> * It'd be valuable to have this go beyond adding transform() methods
>>>>> taking a Row to the current Models.  Instead, it would be ideal to have
>>>>> local, lightweight versions of models in mllib-local, outside of the
main
>>>>> mllib package (for easier deployment with smaller & fewer dependencies).
>>>>> * Supporting Pipelines is important.  For this, it would be ideal to
>>>>> utilize elements of Spark SQL, particularly Rows and Types, which could
be
>>>>> moved into a local sql package.
>>>>> * This architecture may require some awkward APIs currently to have
>>>>> model prediction logic in mllib-local, local model classes in mllib-local,
>>>>> and regular (DataFrame-friendly) model classes in mllib.  We might find
it
>>>>> helpful to break some DeveloperApis in Spark 3.0 to facilitate this
>>>>> architecture while making it feasible for 3rd party developers to extend
>>>>> MLlib APIs (especially in Java).
>>>>>
>>>> I agree this could be interesting, and feed into the other discussion
>>>> around when (or if) we should be considering Spark 3.0
>>>> I _think_ we could probably do it with optional traits people could mix
>>>> in to avoid breaking the current APIs but I could be wrong on that point.
>>>>
>>>>> * It could also be worth discussing local DataFrames.  They might not
>>>>> be as important as per-Row transformations, but they would be helpful
for
>>>>> batching for higher throughput.
>>>>>
>>>> That could be interesting as well.
>>>>
>>>>>
>>>>> I'll be interested to hear others' thoughts too!
>>>>>
>>>>> Joseph
>>>>>
>>>>> On Wed, May 9, 2018 at 7:18 AM, Holden Karau <holden@pigscanfly.ca>
>>>>> wrote:
>>>>>
>>>>>> Hi y'all,
>>>>>>
>>>>>> With the renewed interest in ML in Apache Spark now seems like a
good
>>>>>> a time as any to revisit the online serving situation in Spark ML.
DB &
>>>>>> other's have done some excellent working moving a lot of the necessary
>>>>>> tools into a local linear algebra package that doesn't depend on
having a
>>>>>> SparkContext.
>>>>>>
>>>>>> There are a few different commercial and non-commercial solutions
>>>>>> round this, but currently our individual transform/predict methods
are
>>>>>> private so they either need to copy or re-implement (or put them
selves in
>>>>>> org.apache.spark) to access them. How would folks feel about adding
a new
>>>>>> trait for ML pipeline stages to expose to do transformation of single
>>>>>> element inputs (or local collections) that could be optionally implemented
>>>>>> by stages which support this? That way we can have less copy and
paste code
>>>>>> possibly getting out of sync with our model training.
>>>>>>
>>>>>> I think continuing to have on-line serving grow in different projects
>>>>>> is probably the right path, forward (folks have different needs),
but I'd
>>>>>> love to see us make it simpler for other projects to build reliable
serving
>>>>>> tools.
>>>>>>
>>>>>> I realize this maybe puts some of the folks in an awkward position
>>>>>> with their own commercial offerings, but hopefully if we make it
easier for
>>>>>> everyone the commercial vendors can benefit as well.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Holden :)
>>>>>>
>>>>>> --
>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Joseph Bradley
>>>>>
>>>>> Software Engineer - Machine Learning
>>>>>
>>>>> Databricks, Inc.
>>>>>
>>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> Joseph Bradley
>>>
>>> Software Engineer - Machine Learning
>>>
>>> Databricks, Inc.
>>>
>>> [image: http://databricks.com] <http://databricks.com/>
>>>
>> --
>> --
>> Cheers,
>> Leif
>>
>


-- 
Twitter: https://twitter.com/holdenkarau

Mime
View raw message