spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From andy petrella <>
Subject [re-cont] map and flatMap
Date Wed, 12 Mar 2014 14:06:56 GMT

I want just to pint something out...
I didn't had time yet to sort it out and to think enough to give valuable
strict explanation of -- event though, intuitively I feel they are a lot
===> need spark people or time to move forward.
But here is the thing regarding *flatMap*.

Actually, it looks like (and again intuitively makes sense) that RDD (and
of course DStream) aren't monadic and it is reflected in the implementation
(and signature) of flatMap.

> *  def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = **
> new FlatMappedRDD(this, sc.clean(f))*

There!? flatMap (or bind, >>=) should take a function that use the same
Higher level abstraction in order to be considered as such right?

In this case, it takes a function that returns a TraversableOnce which is
the type of the content of the RDD, and what represent the output is more
the content of the RDD than the RDD itself (still right?).

This actually breaks the understand of map and flatMap

> *def map[U: ClassTag](f: T => U): RDD[U] = new MappedRDD(this,
> sc.clean(f))*

Indeed, RDD is a functor and the underlying reason for flatMap to not take
A => RDD[B] doesn't show up in map.

This has a lot of consequence actually, because at first one might want to
create for-comprehension over RDDs, of even Traversable[F[_]] functions
like sequence -- and he will get stuck since the signature aren't compliant.
More importantly, Scala uses convention on the structure of a type to allow
for-comp... so where Traversable[F[_]] will fail on type, for-comp will
failed weirdly.

Again this signature sounds normal, because my intuitive feeling about RDDs
is that they *only can* be monadic but the composition would depend on the
use case and might have heavy consequences (unioning the RDDs for instance
=> this happening behind the sea can be a big pain, since it wouldn't be
efficient at all).

So Yes, RDD could be monadic but with care.

So what exposes this signature is a way to flatMap over the inner value,
like it is almost the case for Map (flatMapValues)

So, wouldn't be better to rename flatMap as flatMapData (or whatever better
name)? Or to have flatMap requiring a Monad instance of RDD?

Sorry for the prose, just dropped my thoughts and feelings at once :-/


PS: and my English maybe, although my name's Andy I'm a native Belgian ^^.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message