spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacek Laskowski <>
Subject Re: Any plans to migrate Transformer API to Spark SQL (closer to DataFrames)?
Date Sat, 26 Mar 2016 09:26:39 GMT
Hi Joseph,

Thanks for the response. I'm one who doesn't understand all the
hype/need for Machine Learning...yet and through Spark ML(lib) glasses
I'm looking at ML space. In the meantime I've got few assignments (in
a project with Spark and Scala) that have required quite extensive
dataset manipulation.

It was when I sinked into using DataFrame/Dataset for data
manipulation not RDD (I remember talking to Brian about how RDD is an
"assembly" language comparing to the higher-level concept of
DataFrames with Catalysts and other optimizations). After few days
with DataFrame I learnt he was so right! (sorry Brian, it took me
longer to understand your point).

I started using DataFrames in far too many places than one could ever
accept :-) I was so...carried away with DataFrames (esp. show vs
foreach(println) and UDFs via udf() function)

And then, when I moved to Pipeline API and discovered Transformers.
And PipelineStage that can create pipelines of DataFrame manipulation.
They read so well that I'm pretty sure people would love using them
more often, but...they belong to MLlib so they are part of ML space
(not many devs tackled yet). I applied the approach to using
withColumn to have better debugging experience (if I ever need it). I
learnt it after having watched your presentation about Pipeline API.
It was so helpful in my RDD/DataFrame space.

So, to promote a more extensive use of Pipelines, PipelineStages, and
Transformers, I was thinking about moving that part to SQL/DataFrame
API where they really belong. If not, I think people might miss the
beauty of the very fine and so helpful Transformers.

Transformers are *not* a ML thing -- they are DataFrame thing and
should be where they really belong (for their greater adoption).

What do you think?

Jacek Laskowski
Mastering Apache Spark
Follow me at

On Sat, Mar 26, 2016 at 3:23 AM, Joseph Bradley <> wrote:
> There have been some comments about using Pipelines outside of ML, but I
> have not yet seen a real need for it.  If a user does want to use Pipelines
> for non-ML tasks, they still can use Transformers + PipelineModels.  Will
> that work?
> On Fri, Mar 25, 2016 at 8:05 AM, Jacek Laskowski <> wrote:
>> Hi,
>> After few weeks with now, I came to conclusion that
>> Transformer concept from Pipeline API ( should be part
>> of DataFrame (SQL) where they fit better. Are there any plans to
>> migrate Transformer API (ML) to DataFrame (SQL)?
>> Pozdrawiam,
>> Jacek Laskowski
>> ----
>> Mastering Apache Spark
>> Follow me at
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message