spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maximiliano Felice <maximilianofel...@gmail.com>
Subject Re: Revisiting Online serving of Spark models?
Date Tue, 29 May 2018 17:35:11 GMT
Big +1 to a meeting with fresh air.

Could anyone send the invites? I don't really know which is the place
Holden is talking about.

2018-05-29 14:27 GMT-03:00 Felix Cheung <felixcheung_m@hotmail.com>:

> You had me at blue bottle!
>
> _____________________________
> From: Holden Karau <holden@pigscanfly.ca>
> Sent: Tuesday, May 29, 2018 9:47 AM
> Subject: Re: Revisiting Online serving of Spark models?
> To: Felix Cheung <felixcheung_m@hotmail.com>
> Cc: Saikat Kanjilal <sxk1969@hotmail.com>, Maximiliano Felice <
> maximilianofelice@gmail.com>, Joseph Bradley <joseph@databricks.com>,
> Leif Walsh <leif.walsh@gmail.com>, dev <dev@spark.apache.org>
>
>
>
> I'm down for that, we could all go for a walk maybe to the mint plazaa
> blue bottle and grab coffee (if the weather holds have our design meeting
> outside :p)?
>
> On Tue, May 29, 2018 at 9:37 AM, Felix Cheung <felixcheung_m@hotmail.com>
> wrote:
>
>> Bump.
>>
>> ------------------------------
>> *From:* Felix Cheung <felixcheung_m@hotmail.com>
>> *Sent:* Saturday, May 26, 2018 1:05:29 PM
>> *To:* Saikat Kanjilal; Maximiliano Felice; Joseph Bradley
>> *Cc:* Leif Walsh; Holden Karau; dev
>>
>> *Subject:* Re: Revisiting Online serving of Spark models?
>>
>> Hi! How about we meet the community and discuss on June 6 4pm at (near)
>> the Summit?
>>
>> (I propose we meet at the venue entrance so we could accommodate people
>> might not be in the conference)
>>
>> ------------------------------
>> *From:* Saikat Kanjilal <sxk1969@hotmail.com>
>> *Sent:* Tuesday, May 22, 2018 7:47:07 AM
>> *To:* Maximiliano Felice
>> *Cc:* Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev
>> *Subject:* Re: Revisiting Online serving of Spark models?
>>
>> I’m in the same exact boat as Maximiliano and have use cases as well for
>> model serving and would love to join this discussion.
>>
>> Sent from my iPhone
>>
>> On May 22, 2018, at 6:39 AM, Maximiliano Felice <
>> maximilianofelice@gmail.com> wrote:
>>
>> Hi!
>>
>> I'm don't usually write a lot on this list but I keep up to date with the
>> discussions and I'm a heavy user of Spark. This topic caught my attention,
>> as we're currently facing this issue at work. I'm attending to the summit
>> and was wondering if it would it be possible for me to join that meeting. I
>> might be able to share some helpful usecases and ideas.
>>
>> Thanks,
>> Maximiliano Felice
>>
>> El mar., 22 de may. de 2018 9:14 AM, Leif Walsh <leif.walsh@gmail.com>
>> escribió:
>>
>>> I’m with you on json being more readable than parquet, but we’ve had
>>> success using pyarrow’s parquet reader and have been quite happy with it so
>>> far. If your target is python (and probably if not now, then soon, R), you
>>> should look in to it.
>>>
>>> On Mon, May 21, 2018 at 16:52 Joseph Bradley <joseph@databricks.com>
>>> wrote:
>>>
>>>> Regarding model reading and writing, I'll give quick thoughts here:
>>>> * Our approach was to use the same format but write JSON instead of
>>>> Parquet.  It's easier to parse JSON without Spark, and using the same
>>>> format simplifies architecture.  Plus, some people want to check files into
>>>> version control, and JSON is nice for that.
>>>> * The reader/writer APIs could be extended to take format parameters
>>>> (just like DataFrame reader/writers) to handle JSON (and maybe, eventually,
>>>> handle Parquet in the online serving setting).
>>>>
>>>> This would be a big project, so proposing a SPIP might be best.  If
>>>> people are around at the Spark Summit, that could be a good time to meet
up
>>>> & then post notes back to the dev list.
>>>>
>>>> On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <
>>>> felixcheung_m@hotmail.com> wrote:
>>>>
>>>>> Specifically I’d like bring part of the discussion to Model and
>>>>> PipelineModel, and various ModelReader and SharedReadWrite implementations
>>>>> that rely on SparkContext. This is a big blocker on reusing  trained
models
>>>>> outside of Spark for online serving.
>>>>>
>>>>> What’s the next step? Would folks be interested in getting together
to
>>>>> discuss/get some feedback?
>>>>>
>>>>>
>>>>> _____________________________
>>>>> From: Felix Cheung <felixcheung_m@hotmail.com>
>>>>> Sent: Thursday, May 10, 2018 10:10 AM
>>>>> Subject: Re: Revisiting Online serving of Spark models?
>>>>> To: Holden Karau <holden@pigscanfly.ca>, Joseph Bradley <
>>>>> joseph@databricks.com>
>>>>> Cc: dev <dev@spark.apache.org>
>>>>>
>>>>>
>>>>>
>>>>> Huge +1 on this!
>>>>>
>>>>> ------------------------------
>>>>> *From:*holden.karau@gmail.com <holden.karau@gmail.com> on behalf
of
>>>>> Holden Karau <holden@pigscanfly.ca>
>>>>> *Sent:* Thursday, May 10, 2018 9:39:26 AM
>>>>> *To:* Joseph Bradley
>>>>> *Cc:* dev
>>>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>>>
>>>>>
>>>>>
>>>>> On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <joseph@databricks.com
>>>>> > wrote:
>>>>>
>>>>>> Thanks for bringing this up Holden!  I'm a strong supporter of this.
>>>>>>
>>>>>> Awesome! I'm glad other folks think something like this belongs in
>>>>> Spark.
>>>>>
>>>>>> This was one of the original goals for mllib-local: to have local
>>>>>> versions of MLlib models which could be deployed without the big
Spark JARs
>>>>>> and without a SparkContext or SparkSession.  There are related commercial
>>>>>> offerings like this : ) but the overhead of maintaining those offerings
is
>>>>>> pretty high.  Building good APIs within MLlib to avoid copying logic
across
>>>>>> libraries will be well worth it.
>>>>>>
>>>>>> We've talked about this need at Databricks and have also been syncing
>>>>>> with the creators of MLeap.  It'd be great to get this functionality
into
>>>>>> Spark itself.  Some thoughts:
>>>>>> * It'd be valuable to have this go beyond adding transform() methods
>>>>>> taking a Row to the current Models.  Instead, it would be ideal to
have
>>>>>> local, lightweight versions of models in mllib-local, outside of
the main
>>>>>> mllib package (for easier deployment with smaller & fewer dependencies).
>>>>>> * Supporting Pipelines is important.  For this, it would be ideal
to
>>>>>> utilize elements of Spark SQL, particularly Rows and Types, which
could be
>>>>>> moved into a local sql package.
>>>>>> * This architecture may require some awkward APIs currently to have
>>>>>> model prediction logic in mllib-local, local model classes in mllib-local,
>>>>>> and regular (DataFrame-friendly) model classes in mllib.  We might
find it
>>>>>> helpful to break some DeveloperApis in Spark 3.0 to facilitate this
>>>>>> architecture while making it feasible for 3rd party developers to
extend
>>>>>> MLlib APIs (especially in Java).
>>>>>>
>>>>> I agree this could be interesting, and feed into the other discussion
>>>>> around when (or if) we should be considering Spark 3.0
>>>>> I _think_ we could probably do it with optional traits people could
>>>>> mix in to avoid breaking the current APIs but I could be wrong on that
>>>>> point.
>>>>>
>>>>>> * It could also be worth discussing local DataFrames.  They might
not
>>>>>> be as important as per-Row transformations, but they would be helpful
for
>>>>>> batching for higher throughput.
>>>>>>
>>>>> That could be interesting as well.
>>>>>
>>>>>>
>>>>>> I'll be interested to hear others' thoughts too!
>>>>>>
>>>>>> Joseph
>>>>>>
>>>>>> On Wed, May 9, 2018 at 7:18 AM, Holden Karau <holden@pigscanfly.ca>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi y'all,
>>>>>>>
>>>>>>> With the renewed interest in ML in Apache Spark now seems like
a
>>>>>>> good a time as any to revisit the online serving situation in
Spark ML. DB
>>>>>>> & other's have done some excellent working moving a lot of
the necessary
>>>>>>> tools into a local linear algebra package that doesn't depend
on having a
>>>>>>> SparkContext.
>>>>>>>
>>>>>>> There are a few different commercial and non-commercial solutions
>>>>>>> round this, but currently our individual transform/predict methods
are
>>>>>>> private so they either need to copy or re-implement (or put them
selves in
>>>>>>> org.apache.spark) to access them. How would folks feel about
adding a new
>>>>>>> trait for ML pipeline stages to expose to do transformation of
single
>>>>>>> element inputs (or local collections) that could be optionally
implemented
>>>>>>> by stages which support this? That way we can have less copy
and paste code
>>>>>>> possibly getting out of sync with our model training.
>>>>>>>
>>>>>>> I think continuing to have on-line serving grow in different
>>>>>>> projects is probably the right path, forward (folks have different
needs),
>>>>>>> but I'd love to see us make it simpler for other projects to
build reliable
>>>>>>> serving tools.
>>>>>>>
>>>>>>> I realize this maybe puts some of the folks in an awkward position
>>>>>>> with their own commercial offerings, but hopefully if we make
it easier for
>>>>>>> everyone the commercial vendors can benefit as well.
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> Holden :)
>>>>>>>
>>>>>>> --
>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Joseph Bradley
>>>>>>
>>>>>> Software Engineer - Machine Learning
>>>>>>
>>>>>> Databricks, Inc.
>>>>>>
>>>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Joseph Bradley
>>>>
>>>> Software Engineer - Machine Learning
>>>>
>>>> Databricks, Inc.
>>>>
>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>
>>> --
>>> --
>>> Cheers,
>>> Leif
>>>
>>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
>
>
>

Mime
View raw message