spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Armbrust <mich...@databricks.com>
Subject Re: Dataset Outer Join vs RDD Outer Join
Date Wed, 01 Jun 2016 17:42:24 GMT
Thanks for the feedback.  I think this will address at least some of the
problems you are describing: https://github.com/apache/spark/pull/13425

On Wed, Jun 1, 2016 at 9:58 AM, Richard Marscher <rmarscher@localytics.com>
wrote:

> Hi,
>
> I've been working on transitioning from RDD to Datasets in our codebase in
> anticipation of being able to leverage features of 2.0.
>
> I'm having a lot of difficulties with the impedance mismatches between how
> outer joins worked with RDD versus Dataset. The Dataset joins feel like a
> big step backwards IMO. With RDD, leftOuterJoin would give you Option types
> of the results from the right side of the join. This follows idiomatic
> Scala avoiding nulls and was easy to work with.
>
> Now with Dataset there is only joinWith where you specify the join type,
> but it lost all the semantics of identifying missing data from outer joins.
> I can write some enriched methods on Dataset with an implicit class to
> abstract messiness away if Dataset nulled out all mismatching data from an
> outer join, however the problem goes even further in that the values aren't
> always null. Integer, for example, defaults to -1 instead of null. Now it's
> completely ambiguous what data in the join was actually there versus
> populated via this atypical semantic.
>
> Are there additional options available to work around this issue? I can
> convert to RDD and back to Dataset but that's less than ideal.
>
> Thanks,
> --
> *Richard Marscher*
> Senior Software Engineer
> Localytics
> Localytics.com <http://localytics.com/> | Our Blog
> <http://localytics.com/blog> | Twitter <http://twitter.com/localytics> |
> Facebook <http://facebook.com/localytics> | LinkedIn
> <http://www.linkedin.com/company/1148792?trk=tyah>
>

Mime
View raw message