arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dean Chen <d...@dv01.co>
Subject Re: Use case for R Arrow Bindings
Date Thu, 20 Jul 2017 00:01:51 GMT
Sounds good, will get a thread going there.

On Wed, Jul 19, 2017 at 6:02 PM Wes McKinney <wesmckinn@gmail.com> wrote:

> Especially with Arrow support landing in Spark (SPARK-13534), it would
> be helpful to combine efforts between Python and R on this front. I
> also have a long list of improvements to the Feather format that will
> be substantially simpler once library(feather) is depending on the
> main Arrow libraries.
>
> I suggest you reach out to members of the R community directly on
> public forums about development help / advice and soliciting
> collaboration. There are other R venues where you can describe your
> use cases, like the R Consortium and its subcommittees:
> https://www.r-consortium.org/. I would go directly to the mailing
> lists and see if there is anyone who would like to get involved. It's
> more likely that you'll get attention on this problem in the R mailing
> lists than on the Arrow mailing list due to the chicken-and-egg
> aspect.
>
> As a side note, my opinion is that shared storage, memory formats, and
> computing libraries (e.g. native C++ libraries targeting Arrow memory)
> are going to be more and more important to the R / Python / Julia
> communities (and beyond -- Kou has been developing Arrow interfaces
> for Ruby, which has not traditionally had a large data science
> community) as time passes. I would like to personally do more on the R
> side but I simply don't have the bandwidth to take responsibility for
> another major component, especially not in an unfamiliar software
> development stack.
>
> Let me know how I can help, and if there are R mailing list
> discussions where we (the Arrow developers) can chime in please alert
> us to them here.
>
> - Wes
>
> On Wed, Jul 19, 2017 at 5:29 PM, Dean Chen <dean@dv01.co> wrote:
> > I also sent a note about it to the dev list a month ago. Still have a
> huge
> > internal need and interested in helping push this along where we can.
> > Unfortunately, our team is more focused around Spark and doesn't have
> much
> > experience working with the R community.
> >
> > On Wed, Jul 19, 2017 at 1:44 PM Clark Fitzgerald <clarkfitzg@gmail.com>
> > wrote:
> >
> >> Hello all,
> >>
> >> I saw the notes come through from today's call:
> >>
> >> > * R Arrow Bindings?
> >> >  - Find use cases within the R community, contributors needed
> >> >  - R Feather bindings a useful starting point
> >>
> >> This year I've been working on parallel R on datasets in the 100+ GB
> range,
> >> and have found that loading and saving data from text files is a real
> >> bottleneck. Another consideration is breaking the data up into chunks
> for
> >> parallel processing while maintaining metadata and overall structure. So
> >> I've been watching Parquet and Arrow.
> >>
> >> Specifically here are two use cases in R where Arrow / Parquet could be
> >> helpful:
> >>
> >> - Splitting up a large data set into pieces which fit comfortably in
> memory
> >> then applying normal R functions to each piece. Basically GROUP BY.
> >> - Matloff's Software Alchemy, statistical averaging based on independent
> >> chunks of data. This requires rows to be randomly assigned to chunks.
> >>
> >> Another option besides starting from the R Feather bindings is to start
> >> with an automatically generated set of bindings:
> >> https://github.com/duncantl/RCodeGen
> >>
> >> Best,
> >> Clark Fitzgerald
> >>
> > --
> > VP of Engineering - dv01, Featured in Forbes Fintech 50 For 2016
> > <http://www.forbes.com/fintech/2016/#310668d56680>
> > 915 Broadway | Suite 502 | New York, NY 10010
> > (646)-838-2310 <(646)%20838-2310>
> > dean@dv01.co | www.dv01.co
>
-- 
VP of Engineering - dv01, Featured in Forbes Fintech 50 For 2016
<http://www.forbes.com/fintech/2016/#310668d56680>
915 Broadway | Suite 502 | New York, NY 10010
(646)-838-2310
dean@dv01.co | www.dv01.co

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message