spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maximiliano Felice <maximilianofel...@gmail.com>
Subject Re: Revisiting Online serving of Spark models?
Date Wed, 06 Jun 2018 21:42:48 GMT
Hi!

Do we meet at the entrance?

See you

El mar., 5 de jun. de 2018 3:07 PM, Nick Pentreath <nick.pentreath@gmail.com>
escribió:

> I will aim to join up at 4pm tomorrow (Wed) too. Look forward to it.
>
> On Sun, 3 Jun 2018 at 00:24 Holden Karau <holden@pigscanfly.ca> wrote:
>
>> On Sat, Jun 2, 2018 at 8:39 PM, Maximiliano Felice <
>> maximilianofelice@gmail.com> wrote:
>>
>>> Hi!
>>>
>>> We're already in San Francisco waiting for the summit. We even think
>>> that we spotted @holdenk this afternoon.
>>>
>> Unless you happened to be walking by my garage probably not super likely,
>> spent the day working on scooters/motorcycles (my style is a little less
>> unique in SF :)). Also if you see me feel free to say hi unless I look like
>> I haven't had my first coffee of the day, love chatting with folks IRL :)
>>
>>>
>>> @chris, we're really interested in the Meetup you're hosting. My team
>>> will probably join it since the beginning of you have room for us, and I'll
>>> join it later after discussing the topics on this thread. I'll send you an
>>> email regarding this request.
>>>
>>> Thanks
>>>
>>> El vie., 1 de jun. de 2018 7:26 AM, Saikat Kanjilal <sxk1969@hotmail.com>
>>> escribió:
>>>
>>>> @Chris This sounds fantastic, please send summary notes for Seattle
>>>> folks
>>>>
>>>> @Felix I work in downtown Seattle, am wondering if we should a tech
>>>> meetup around model serving in spark at my work or elsewhere close,
>>>> thoughts?  I’m actually in the midst of building microservices to manage
>>>> models and when I say models I mean much more than machine learning models
>>>> (think OR, process models as well)
>>>>
>>>> Regards
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On May 31, 2018, at 10:32 PM, Chris Fregly <chris@fregly.com> wrote:
>>>>
>>>> Hey everyone!
>>>>
>>>> @Felix:  thanks for putting this together.  i sent some of you a quick
>>>> calendar event - mostly for me, so i don’t forget!  :)
>>>>
>>>> Coincidentally, this is the focus of June 6th's *Advanced Spark and
>>>> TensorFlow Meetup*
>>>> <https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/>
@5:30pm
>>>> on June 6th (same night) here in SF!
>>>>
>>>> Everybody is welcome to come.  Here’s the link to the meetup that
>>>> includes the signup link:
>>>> *https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/*
>>>> <https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/250924195/>
>>>>
>>>> We have an awesome lineup of speakers covered a lot of deep, technical
>>>> ground.
>>>>
>>>> For those who can’t attend in person, we’ll be broadcasting live - and
>>>> posting the recording afterward.
>>>>
>>>> All details are in the meetup link above…
>>>>
>>>> @holden/felix/nick/joseph/maximiliano/saikat/leif:  you’re more than
>>>> welcome to give a talk. I can move things around to make room.
>>>>
>>>> @joseph:  I’d personally like an update on the direction of the
>>>> Databricks proprietary ML Serving export format which is similar to PMML
>>>> but not a standard in any way.
>>>>
>>>> Also, the Databricks ML Serving Runtime is only available to Databricks
>>>> customers.  This seems in conflict with the community efforts described
>>>> here.  Can you comment on behalf of Databricks?
>>>>
>>>> Look forward to your response, joseph.
>>>>
>>>> See you all soon!
>>>>
>>>> —
>>>>
>>>>
>>>> *Chris Fregly *Founder @ *PipelineAI* <https://pipeline.ai/> (100,000
>>>> Users)
>>>> Organizer @ *Advanced Spark and TensorFlow Meetup*
>>>> <https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/> (85,000
>>>> Global Members)
>>>>
>>>>
>>>>
>>>> *San Francisco - Chicago - Austin -
>>>> Washington DC - London - Dusseldorf *
>>>> *Try our PipelineAI Community Edition with GPUs and TPUs!!
>>>> <http://community.pipeline.ai/>*
>>>>
>>>>
>>>> On May 30, 2018, at 9:32 AM, Felix Cheung <felixcheung_m@hotmail.com>
>>>> wrote:
>>>>
>>>> Hi!
>>>>
>>>> Thank you! Let’s meet then
>>>>
>>>> June 6 4pm
>>>>
>>>> Moscone West Convention Center
>>>> 800 Howard Street, San Francisco, CA 94103
>>>> <https://maps.google.com/?q=800+Howard+Street,+San+Francisco,+CA+94103&entry=gmail&source=g>
>>>>
>>>> Ground floor (outside of conference area - should be available for all)
>>>> - we will meet and decide where to go
>>>>
>>>> (Would not send invite because that would be too much noise for dev@)
>>>>
>>>> To paraphrase Joseph, we will use this to kick off the discusssion and
>>>> post notes after and follow up online. As for Seattle, I would be very
>>>> interested to meet in person lateen and discuss ;)
>>>>
>>>>
>>>> _____________________________
>>>> From: Saikat Kanjilal <sxk1969@hotmail.com>
>>>> Sent: Tuesday, May 29, 2018 11:46 AM
>>>> Subject: Re: Revisiting Online serving of Spark models?
>>>> To: Maximiliano Felice <maximilianofelice@gmail.com>
>>>> Cc: Felix Cheung <felixcheung_m@hotmail.com>, Holden Karau <
>>>> holden@pigscanfly.ca>, Joseph Bradley <joseph@databricks.com>, Leif
>>>> Walsh <leif.walsh@gmail.com>, dev <dev@spark.apache.org>
>>>>
>>>>
>>>> Would love to join but am in Seattle, thoughts on how to make this
>>>> work?
>>>>
>>>> Regards
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On May 29, 2018, at 10:35 AM, Maximiliano Felice <
>>>> maximilianofelice@gmail.com> wrote:
>>>>
>>>> Big +1 to a meeting with fresh air.
>>>>
>>>> Could anyone send the invites? I don't really know which is the place
>>>> Holden is talking about.
>>>>
>>>> 2018-05-29 14:27 GMT-03:00 Felix Cheung <felixcheung_m@hotmail.com>:
>>>>
>>>>> You had me at blue bottle!
>>>>>
>>>>> _____________________________
>>>>> From: Holden Karau <holden@pigscanfly.ca>
>>>>> Sent: Tuesday, May 29, 2018 9:47 AM
>>>>> Subject: Re: Revisiting Online serving of Spark models?
>>>>> To: Felix Cheung <felixcheung_m@hotmail.com>
>>>>> Cc: Saikat Kanjilal <sxk1969@hotmail.com>, Maximiliano Felice <
>>>>> maximilianofelice@gmail.com>, Joseph Bradley <joseph@databricks.com>,
>>>>> Leif Walsh <leif.walsh@gmail.com>, dev <dev@spark.apache.org>
>>>>>
>>>>>
>>>>>
>>>>> I'm down for that, we could all go for a walk maybe to the mint plazaa
>>>>> blue bottle and grab coffee (if the weather holds have our design meeting
>>>>> outside :p)?
>>>>>
>>>>> On Tue, May 29, 2018 at 9:37 AM, Felix Cheung <
>>>>> felixcheung_m@hotmail.com> wrote:
>>>>>
>>>>>> Bump.
>>>>>>
>>>>>> ------------------------------
>>>>>> *From:* Felix Cheung <felixcheung_m@hotmail.com>
>>>>>> *Sent:* Saturday, May 26, 2018 1:05:29 PM
>>>>>> *To:* Saikat Kanjilal; Maximiliano Felice; Joseph Bradley
>>>>>> *Cc:* Leif Walsh; Holden Karau; dev
>>>>>>
>>>>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>>>>
>>>>>> Hi! How about we meet the community and discuss on June 6 4pm at
>>>>>> (near) the Summit?
>>>>>>
>>>>>> (I propose we meet at the venue entrance so we could accommodate
>>>>>> people might not be in the conference)
>>>>>>
>>>>>> ------------------------------
>>>>>> *From:* Saikat Kanjilal <sxk1969@hotmail.com>
>>>>>> *Sent:* Tuesday, May 22, 2018 7:47:07 AM
>>>>>> *To:* Maximiliano Felice
>>>>>> *Cc:* Leif Walsh; Felix Cheung; Holden Karau; Joseph Bradley; dev
>>>>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>>>>
>>>>>> I’m in the same exact boat as Maximiliano and have use cases as
well
>>>>>> for model serving and would love to join this discussion.
>>>>>>
>>>>>> Sent from my iPhone
>>>>>>
>>>>>> On May 22, 2018, at 6:39 AM, Maximiliano Felice <
>>>>>> maximilianofelice@gmail.com> wrote:
>>>>>>
>>>>>> Hi!
>>>>>>
>>>>>> I'm don't usually write a lot on this list but I keep up to date
with
>>>>>> the discussions and I'm a heavy user of Spark. This topic caught
my
>>>>>> attention, as we're currently facing this issue at work. I'm attending
to
>>>>>> the summit and was wondering if it would it be possible for me to
join that
>>>>>> meeting. I might be able to share some helpful usecases and ideas.
>>>>>>
>>>>>> Thanks,
>>>>>> Maximiliano Felice
>>>>>>
>>>>>> El mar., 22 de may. de 2018 9:14 AM, Leif Walsh <leif.walsh@gmail.com>
>>>>>> escribió:
>>>>>>
>>>>>>> I’m with you on json being more readable than parquet, but
we’ve had
>>>>>>> success using pyarrow’s parquet reader and have been quite
happy with it so
>>>>>>> far. If your target is python (and probably if not now, then
soon, R), you
>>>>>>> should look in to it.
>>>>>>>
>>>>>>> On Mon, May 21, 2018 at 16:52 Joseph Bradley <joseph@databricks.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Regarding model reading and writing, I'll give quick thoughts
here:
>>>>>>>> * Our approach was to use the same format but write JSON
instead of
>>>>>>>> Parquet.  It's easier to parse JSON without Spark, and using
the same
>>>>>>>> format simplifies architecture.  Plus, some people want to
check files into
>>>>>>>> version control, and JSON is nice for that.
>>>>>>>> * The reader/writer APIs could be extended to take format
>>>>>>>> parameters (just like DataFrame reader/writers) to handle
JSON (and maybe,
>>>>>>>> eventually, handle Parquet in the online serving setting).
>>>>>>>>
>>>>>>>> This would be a big project, so proposing a SPIP might be
best.  If
>>>>>>>> people are around at the Spark Summit, that could be a good
time to meet up
>>>>>>>> & then post notes back to the dev list.
>>>>>>>>
>>>>>>>> On Sun, May 20, 2018 at 8:11 PM, Felix Cheung <
>>>>>>>> felixcheung_m@hotmail.com> wrote:
>>>>>>>>
>>>>>>>>> Specifically I’d like bring part of the discussion
to Model and
>>>>>>>>> PipelineModel, and various ModelReader and SharedReadWrite
implementations
>>>>>>>>> that rely on SparkContext. This is a big blocker on reusing
 trained models
>>>>>>>>> outside of Spark for online serving.
>>>>>>>>>
>>>>>>>>> What’s the next step? Would folks be interested in
getting
>>>>>>>>> together to discuss/get some feedback?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _____________________________
>>>>>>>>> From: Felix Cheung <felixcheung_m@hotmail.com>
>>>>>>>>> Sent: Thursday, May 10, 2018 10:10 AM
>>>>>>>>> Subject: Re: Revisiting Online serving of Spark models?
>>>>>>>>> To: Holden Karau <holden@pigscanfly.ca>, Joseph
Bradley <
>>>>>>>>> joseph@databricks.com>
>>>>>>>>> Cc: dev <dev@spark.apache.org>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Huge +1 on this!
>>>>>>>>>
>>>>>>>>> ------------------------------
>>>>>>>>> *From:*holden.karau@gmail.com <holden.karau@gmail.com>
on behalf
>>>>>>>>> of Holden Karau <holden@pigscanfly.ca>
>>>>>>>>> *Sent:* Thursday, May 10, 2018 9:39:26 AM
>>>>>>>>> *To:* Joseph Bradley
>>>>>>>>> *Cc:* dev
>>>>>>>>> *Subject:* Re: Revisiting Online serving of Spark models?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley <
>>>>>>>>> joseph@databricks.com> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks for bringing this up Holden!  I'm a strong
supporter of
>>>>>>>>>> this.
>>>>>>>>>>
>>>>>>>>>> Awesome! I'm glad other folks think something like
this belongs
>>>>>>>>> in Spark.
>>>>>>>>>
>>>>>>>>>> This was one of the original goals for mllib-local:
to have local
>>>>>>>>>> versions of MLlib models which could be deployed
without the big Spark JARs
>>>>>>>>>> and without a SparkContext or SparkSession.  There
are related commercial
>>>>>>>>>> offerings like this : ) but the overhead of maintaining
those offerings is
>>>>>>>>>> pretty high.  Building good APIs within MLlib to
avoid copying logic across
>>>>>>>>>> libraries will be well worth it.
>>>>>>>>>>
>>>>>>>>>> We've talked about this need at Databricks and have
also been
>>>>>>>>>> syncing with the creators of MLeap.  It'd be great
to get this
>>>>>>>>>> functionality into Spark itself.  Some thoughts:
>>>>>>>>>> * It'd be valuable to have this go beyond adding
transform()
>>>>>>>>>> methods taking a Row to the current Models.  Instead,
it would be ideal to
>>>>>>>>>> have local, lightweight versions of models in mllib-local,
outside of the
>>>>>>>>>> main mllib package (for easier deployment with smaller
& fewer
>>>>>>>>>> dependencies).
>>>>>>>>>> * Supporting Pipelines is important.  For this, it
would be ideal
>>>>>>>>>> to utilize elements of Spark SQL, particularly Rows
and Types, which could
>>>>>>>>>> be moved into a local sql package.
>>>>>>>>>> * This architecture may require some awkward APIs
currently to
>>>>>>>>>> have model prediction logic in mllib-local, local
model classes in
>>>>>>>>>> mllib-local, and regular (DataFrame-friendly) model
classes in mllib.  We
>>>>>>>>>> might find it helpful to break some DeveloperApis
in Spark 3.0 to
>>>>>>>>>> facilitate this architecture while making it feasible
for 3rd party
>>>>>>>>>> developers to extend MLlib APIs (especially in Java).
>>>>>>>>>>
>>>>>>>>> I agree this could be interesting, and feed into the
other
>>>>>>>>> discussion around when (or if) we should be considering
Spark 3.0
>>>>>>>>> I _think_ we could probably do it with optional traits
people
>>>>>>>>> could mix in to avoid breaking the current APIs but I
could be wrong on
>>>>>>>>> that point.
>>>>>>>>>
>>>>>>>>>> * It could also be worth discussing local DataFrames.
 They might
>>>>>>>>>> not be as important as per-Row transformations, but
they would be helpful
>>>>>>>>>> for batching for higher throughput.
>>>>>>>>>>
>>>>>>>>> That could be interesting as well.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I'll be interested to hear others' thoughts too!
>>>>>>>>>>
>>>>>>>>>> Joseph
>>>>>>>>>>
>>>>>>>>>> On Wed, May 9, 2018 at 7:18 AM, Holden Karau <
>>>>>>>>>> holden@pigscanfly.ca> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi y'all,
>>>>>>>>>>>
>>>>>>>>>>> With the renewed interest in ML in Apache Spark
now seems like a
>>>>>>>>>>> good a time as any to revisit the online serving
situation in Spark ML. DB
>>>>>>>>>>> & other's have done some excellent working
moving a lot of the necessary
>>>>>>>>>>> tools into a local linear algebra package that
doesn't depend on having a
>>>>>>>>>>> SparkContext.
>>>>>>>>>>>
>>>>>>>>>>> There are a few different commercial and non-commercial
>>>>>>>>>>> solutions round this, but currently our individual
transform/predict
>>>>>>>>>>> methods are private so they either need to copy
or re-implement (or put
>>>>>>>>>>> them selves in org.apache.spark) to access them.
How would folks feel about
>>>>>>>>>>> adding a new trait for ML pipeline stages to
expose to do transformation of
>>>>>>>>>>> single element inputs (or local collections)
that could be optionally
>>>>>>>>>>> implemented by stages which support this? That
way we can have less copy
>>>>>>>>>>> and paste code possibly getting out of sync with
our model training.
>>>>>>>>>>>
>>>>>>>>>>> I think continuing to have on-line serving grow
in different
>>>>>>>>>>> projects is probably the right path, forward
(folks have different needs),
>>>>>>>>>>> but I'd love to see us make it simpler for other
projects to build reliable
>>>>>>>>>>> serving tools.
>>>>>>>>>>>
>>>>>>>>>>> I realize this maybe puts some of the folks in
an awkward
>>>>>>>>>>> position with their own commercial offerings,
but hopefully if we make it
>>>>>>>>>>> easier for everyone the commercial vendors can
benefit as well.
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>>
>>>>>>>>>>> Holden :)
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Joseph Bradley
>>>>>>>>>> Software Engineer - Machine Learning
>>>>>>>>>> Databricks, Inc.
>>>>>>>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Joseph Bradley
>>>>>>>> Software Engineer - Machine Learning
>>>>>>>> Databricks, Inc.
>>>>>>>> [image: http://databricks.com] <http://databricks.com/>
>>>>>>>>
>>>>>>> --
>>>>>>> --
>>>>>>> Cheers,
>>>>>>> Leif
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>

Mime
View raw message