datafu-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthew Hayes (JIRA)" <>
Subject [jira] [Commented] (DATAFU-148) Setup Spark sub-project
Date Thu, 16 Aug 2018 22:44:00 GMT


Matthew Hayes commented on DATAFU-148:

Thanks [~eyal] and [~uzadude] for submitting this and setting up the initial Spark subproject. 
This looks like a great start.  I look forward to seeing more of the Spark code you have. 
I reviewed the code and have the following comments:

In SparkDFUtils.scala:
- dedup2 could use some additional description to differentiate it from dedup.
- flatten is missing documentation
- for broadcastJoinSkewed, the description of the numberCustsToBroadcast field isn't clear
to me
- joinWithRange could use some more documentation. For example, the fields are not all documented.
It's not immediately obvious to me what DECREASE_FACTOR does and why it should have a default
value of 2^8.
- Also, joinWithRange seems characteristically different from the other in this file, as it's
a bit more use-case specific. Maybe later it would make sense to move this to a separate file.

In build.gradle:
- The download plugin isn't needed.
- Is autojarring necessary? Looking at the contents of the datafu-spark jar, we only have
datafu.spark and org.apache.spark classes. It seems like org.apache.spark classes shouldn't
need to be included. Also the build.gradle autojars commonsmath and guava, which aren't used.
It seems all this jarjar and autojar stuff could be stripped out of this file.

flattten and changeSchema should have tests I think.

A question regarding documentation: people would generally by using these via {{DataFrameOps}},
so it would probably be helpful to have doc links in those methods to the underlying implementation. 
Is the reason {{SparkDFUtils}} is split out into a separate file so that it can be used in
the future by other methods?  By the way, I found out you can generate the docs with the
command below.  Before including this in a release it would be good to review the generated
docs and see where they can be improved.  For example, the packages and objects don't have
./gradlew :datafu-spark:scaladoc
Also, if we were to merge this in somewhere it should probably go into a new pending release
branch like 2.0.0 so we can continue to work on getting it ready independent of short term
releases.  I think this should trigger a major version bump since it is a new sub-project
and gives us the chance to clean up anything we've deprecated. Thoughts?

> Setup Spark sub-project
> -----------------------
>                 Key: DATAFU-148
>                 URL:
>             Project: DataFu
>          Issue Type: New Feature
>            Reporter: Eyal Allweil
>            Assignee: Eyal Allweil
>            Priority: Major
> Create a skeleton Spark sub project for Spark code to be contributed to DataFu

This message was sent by Atlassian JIRA

View raw message