spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michał Zieliński <zielinski.mich...@gmail.com>
Subject Re: Any plans to migrate Transformer API to Spark SQL (closer to DataFrames)?
Date Mon, 28 Mar 2016 07:49:45 GMT
Hi Maciej,

Absolutely. We had to copy HasInputCol/s, HasOutputCol/s (along with a
couple of others like HasProbabilityCol) to our repo. Which for most
use-cases is good enough, but for some (e.g. operating on any Transformer
that accepts either our or Sparks HasInputCol) makes the code clunky.
Opening those traits to the public would be a big gain.

Thanks,
Michal

On 28 March 2016 at 07:44, Jacek Laskowski <jacek@japila.pl> wrote:

> Hi,
>
> Never develop any custom Transformer (or UnaryTransformer in particular),
> but I'd be for it if that's the case.
>
> Jacek
> 28.03.2016 6:54 AM "Maciej Szymkiewicz" <mszymkiewicz@gmail.com>
> napisał(a):
>
>> Hi Jacek,
>>
>> In this context, don't you think it would be useful, if at least some
>> traits from org.apache.spark.ml.param.shared.sharedParams were
>> public?HasInputCol(s) and HasOutputCol for example. These are useful
>> pretty much every time you create custom Transformer.
>>
>> --
>> Pozdrawiam,
>> Maciej Szymkiewicz
>>
>>
>> On 03/26/2016 10:26 AM, Jacek Laskowski wrote:
>> > Hi Joseph,
>> >
>> > Thanks for the response. I'm one who doesn't understand all the
>> > hype/need for Machine Learning...yet and through Spark ML(lib) glasses
>> > I'm looking at ML space. In the meantime I've got few assignments (in
>> > a project with Spark and Scala) that have required quite extensive
>> > dataset manipulation.
>> >
>> > It was when I sinked into using DataFrame/Dataset for data
>> > manipulation not RDD (I remember talking to Brian about how RDD is an
>> > "assembly" language comparing to the higher-level concept of
>> > DataFrames with Catalysts and other optimizations). After few days
>> > with DataFrame I learnt he was so right! (sorry Brian, it took me
>> > longer to understand your point).
>> >
>> > I started using DataFrames in far too many places than one could ever
>> > accept :-) I was so...carried away with DataFrames (esp. show vs
>> > foreach(println) and UDFs via udf() function)
>> >
>> > And then, when I moved to Pipeline API and discovered Transformers.
>> > And PipelineStage that can create pipelines of DataFrame manipulation.
>> > They read so well that I'm pretty sure people would love using them
>> > more often, but...they belong to MLlib so they are part of ML space
>> > (not many devs tackled yet). I applied the approach to using
>> > withColumn to have better debugging experience (if I ever need it). I
>> > learnt it after having watched your presentation about Pipeline API.
>> > It was so helpful in my RDD/DataFrame space.
>> >
>> > So, to promote a more extensive use of Pipelines, PipelineStages, and
>> > Transformers, I was thinking about moving that part to SQL/DataFrame
>> > API where they really belong. If not, I think people might miss the
>> > beauty of the very fine and so helpful Transformers.
>> >
>> > Transformers are *not* a ML thing -- they are DataFrame thing and
>> > should be where they really belong (for their greater adoption).
>> >
>> > What do you think?
>> >
>> >
>> > Pozdrawiam,
>> > Jacek Laskowski
>> > ----
>> > https://medium.com/@jaceklaskowski/
>> > Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> > Follow me at https://twitter.com/jaceklaskowski
>> >
>> >
>> > On Sat, Mar 26, 2016 at 3:23 AM, Joseph Bradley <joseph@databricks.com>
>> wrote:
>> >> There have been some comments about using Pipelines outside of ML, but
>> I
>> >> have not yet seen a real need for it.  If a user does want to use
>> Pipelines
>> >> for non-ML tasks, they still can use Transformers + PipelineModels.
>> Will
>> >> that work?
>> >>
>> >> On Fri, Mar 25, 2016 at 8:05 AM, Jacek Laskowski <jacek@japila.pl>
>> wrote:
>> >>> Hi,
>> >>>
>> >>> After few weeks with spark.ml now, I came to conclusion that
>> >>> Transformer concept from Pipeline API (spark.ml/MLlib) should be part
>> >>> of DataFrame (SQL) where they fit better. Are there any plans to
>> >>> migrate Transformer API (ML) to DataFrame (SQL)?
>> >>>
>> >>> Pozdrawiam,
>> >>> Jacek Laskowski
>> >>> ----
>> >>> https://medium.com/@jaceklaskowski/
>> >>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> >>> Follow me at https://twitter.com/jaceklaskowski
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> >>> For additional commands, e-mail: dev-help@spark.apache.org
>> >>>
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> > For additional commands, e-mail: dev-help@spark.apache.org
>> >
>>
>>
>>

Mime
View raw message