arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dean Chen <d...@dv01.co>
Subject Re: Improve SparkR collect performance with Arrow
Date Mon, 15 May 2017 15:01:49 GMT
Hi Wes,

We can work with the Spark community on the Spark/SparkR integration.

Also happy to help with migrating the R package from Feather in to Arrow.

Have anyone in mind to manage the R/Rcpp binding issues? I reviewed the R
and cpp files in https://github.com/wesm/feather/tree/master/R and we may
be able to take a first pass on it to get things kicked off the ground.
Will still want an expert with Rcpp to review and own since we're not
experts with Rcpp and I'm sure it's riddled with lots of caveats like any
other fdw.

We maintaining lots of R packages internally and can help or take the lead
on R packaging/builds/testing in travis in the Arrow project.

On Sun, May 14, 2017 at 2:46 PM Wes McKinney <wesmckinn@gmail.com> wrote:

> Note I just opened https://github.com/wesm/feather/pull/297 which deletes
> all of the Feather Python code (using pyarrow as a dependency).
>
> On Sun, May 14, 2017 at 2:44 PM, Wes McKinney <wesmckinn@gmail.com> wrote:
>
> > hi Dean,
> >
> > In Arrow 0.3 we incorporated the C++ and Python code from wesm/feather
> > into the Arrow repo. The Feather format is a simplified version of the
> > Arrow IPC format (which has file/batch and stream flavors), so the ideal
> > approach would be to move the Feather R/Rcpp wrapper code into the Arrow
> > codebase and generalize it to support the Arrow streams that are coming
> > from Spark (as in SPARK-13534).
> >
> > Adding support for nested types should also be possible -- we have
> > implemented more of the converters for them on the Python side. The
> Feather
> > format doesn't support nested types, so we would want to deprecate that
> > format as soon as practical (Feather has plenty of users; and we can
> always
> > maintain the library(feather) import and associated R API).
> >
> > In any case, this seems like an ideal collaboration for the Spark and
> > Arrow communities; what is missing is an experienced developer from the R
> > community who can manage the R/Rcpp binding issues (I can help some with
> > maintaining the C++ side of the bindings) and address packaging / builds
> /
> > continuous integration.
> >
> > - Wes
> >
> > On Sun, May 14, 2017 at 1:26 PM, Dean Chen <dean@dv01.co> wrote:
> >
> >> Following up on the discussion from
> >> https://issues.apache.org/jira/browse/SPARK-18924. We have internal use
> >> cases that would benefit significantly from improved collect performance
> >> and would like to kick off a similar proposal/effort to
> >> https://issues.apache.org/jira/browse/SPARK-13534 for SparkR.
> >>
> >> Complex datatypes introduced additional complexity to 13534 and it's
> not a
> >> requirement for us so thinking the initial proposal would be for simple
> >> types with fall back on the current implementation for complex types.
> >>
> >> Integration would involve introducing a flag to enable the arrow
> >> serialization logic *collect*(
> >>
> https://github.com/apache/spark/blob/branch-2.2/R/pkg/R/DataFrame.R#L1129
> >> )
> >> that would call an Arrow implementation of *dfToCols*
> >> https://github.com/apache/spark/blob/branch-2.2/sql/core/
> >> src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L211
> >> that
> >> returns Arrow byte arrays.
> >>
> >> Looks like https://github.com/wesm/feather hasn't been updated since
> the
> >> Arrow 0.3 release so assuming it would have to be updated to enable
> >> converting the byte array from dfToCols to R dataframes? Wes also
> brought
> >> up that unified serialization implementation for Spark/Scala, R and
> Python
> >> to enable easy sharing of IO optimizations.
> >>
> >> Please let us know your thoughts/opinions on the above and the preferred
> >> way of collaborating with the Arrow community on this.
> >> --
> >> VP of Engineering - dv01, Featured in Forbes Fintech 50 For 2016
> >> <http://www.forbes.com/fintech/2016/#310668d56680>
> >> 915 Broadway | Suite 502 | New York, NY 10010
> >> (646)-838-2310 <(646)%20838-2310>
> >> dean@dv01.co | www.dv01.co
> >>
> >
> >
>
-- 
VP of Engineering - dv01, Featured in Forbes Fintech 50 For 2016
<http://www.forbes.com/fintech/2016/#310668d56680>
915 Broadway | Suite 502 | New York, NY 10010
(646)-838-2310
dean@dv01.co | www.dv01.co

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message