arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Clark Fitzgerald <clarkfi...@gmail.com>
Subject Re: Use case for R Arrow Bindings
Date Tue, 25 Jul 2017 04:44:53 GMT
Great, I'll be on the call. The first steps I took today with the
automatically generated bindings from the C++ source seem promising. Much
more work is required to make it usable though.

On Mon, Jul 24, 2017 at 9:00 PM, Kevin Moore <kevin@quiltdata.io> wrote:

> A group of Quilt users and team members interested in R is planning a short
> call to get the ball rolling on R bindings for Arrow (and Quilt) tomorrow
> at 4PM Pacific. We'd love to have anyone who's interested from this list
> join us in the hangout:
> https://hangouts.google.com/hangouts/_/quiltdata.io/aneesh?authuser=1
>
> Thanks,
>
> Kevin
>
> ----
> Kevin Moore
> CEO, Quilt Data, Inc.
> kevin@quiltdata.io | LinkedIn <https://www.linkedin.com/in/kevinemoore/>
> (415) 497-7895
>
>
> Manage Data like Code
> quiltdata.com
>
> On Mon, Jul 24, 2017 at 7:58 AM, Wes McKinney <wesmckinn@gmail.com> wrote:
>
> > + Hadley
> >
> > On Fri, Jul 21, 2017 at 2:04 PM, Bryan Cutler <cutlerb@gmail.com> wrote:
> > > Thanks Clark.  I know that SparkR would benefit a lot from Arrow
> bindings
> > > and many people would like to see that, but to my knowledge no one has
> > > started working on this yet.  Please keep us updated with what you
> find!
> > >
> > > Bryan
> > >
> > > On Fri, Jul 21, 2017 at 9:15 AM, Clark Fitzgerald <
> clarkfitzg@gmail.com>
> > > wrote:
> > >
> > >> Regarding the R Consortium, the Distributed Computing Working Group
> led
> > by
> > >> Michael Lawrence would be interested in this. It would be nice to go
> to
> > >> them with some working examples and use cases.
> > >>
> > >> Next week I will start looking into R / Arrow bindings. A couple other
> > >> people at the UC Davis Data Science Initiative have expressed interest
> > as
> > >> well. I'll post updates here.
> > >>
> > >> On Wed, Jul 19, 2017 at 5:01 PM, Dean Chen <dean@dv01.co> wrote:
> > >>
> > >> > Sounds good, will get a thread going there.
> > >> >
> > >> > On Wed, Jul 19, 2017 at 6:02 PM Wes McKinney <wesmckinn@gmail.com>
> > >> wrote:
> > >> >
> > >> > > Especially with Arrow support landing in Spark (SPARK-13534),
it
> > would
> > >> > > be helpful to combine efforts between Python and R on this front.
> I
> > >> > > also have a long list of improvements to the Feather format that
> > will
> > >> > > be substantially simpler once library(feather) is depending on
the
> > >> > > main Arrow libraries.
> > >> > >
> > >> > > I suggest you reach out to members of the R community directly
on
> > >> > > public forums about development help / advice and soliciting
> > >> > > collaboration. There are other R venues where you can describe
> your
> > >> > > use cases, like the R Consortium and its subcommittees:
> > >> > > https://www.r-consortium.org/. I would go directly to the mailing
> > >> > > lists and see if there is anyone who would like to get involved.
> > It's
> > >> > > more likely that you'll get attention on this problem in the
R
> > mailing
> > >> > > lists than on the Arrow mailing list due to the chicken-and-egg
> > >> > > aspect.
> > >> > >
> > >> > > As a side note, my opinion is that shared storage, memory formats,
> > and
> > >> > > computing libraries (e.g. native C++ libraries targeting Arrow
> > memory)
> > >> > > are going to be more and more important to the R / Python / Julia
> > >> > > communities (and beyond -- Kou has been developing Arrow
> interfaces
> > >> > > for Ruby, which has not traditionally had a large data science
> > >> > > community) as time passes. I would like to personally do more
on
> > the R
> > >> > > side but I simply don't have the bandwidth to take responsibility
> > for
> > >> > > another major component, especially not in an unfamiliar software
> > >> > > development stack.
> > >> > >
> > >> > > Let me know how I can help, and if there are R mailing list
> > >> > > discussions where we (the Arrow developers) can chime in please
> > alert
> > >> > > us to them here.
> > >> > >
> > >> > > - Wes
> > >> > >
> > >> > > On Wed, Jul 19, 2017 at 5:29 PM, Dean Chen <dean@dv01.co>
wrote:
> > >> > > > I also sent a note about it to the dev list a month ago.
Still
> > have a
> > >> > > huge
> > >> > > > internal need and interested in helping push this along
where we
> > can.
> > >> > > > Unfortunately, our team is more focused around Spark and
doesn't
> > have
> > >> > > much
> > >> > > > experience working with the R community.
> > >> > > >
> > >> > > > On Wed, Jul 19, 2017 at 1:44 PM Clark Fitzgerald <
> > >> clarkfitzg@gmail.com
> > >> > >
> > >> > > > wrote:
> > >> > > >
> > >> > > >> Hello all,
> > >> > > >>
> > >> > > >> I saw the notes come through from today's call:
> > >> > > >>
> > >> > > >> > * R Arrow Bindings?
> > >> > > >> >  - Find use cases within the R community, contributors
needed
> > >> > > >> >  - R Feather bindings a useful starting point
> > >> > > >>
> > >> > > >> This year I've been working on parallel R on datasets
in the
> > 100+ GB
> > >> > > range,
> > >> > > >> and have found that loading and saving data from text
files is
> a
> > >> real
> > >> > > >> bottleneck. Another consideration is breaking the data
up into
> > >> chunks
> > >> > > for
> > >> > > >> parallel processing while maintaining metadata and overall
> > >> structure.
> > >> > So
> > >> > > >> I've been watching Parquet and Arrow.
> > >> > > >>
> > >> > > >> Specifically here are two use cases in R where Arrow
/ Parquet
> > could
> > >> > be
> > >> > > >> helpful:
> > >> > > >>
> > >> > > >> - Splitting up a large data set into pieces which fit
> > comfortably in
> > >> > > memory
> > >> > > >> then applying normal R functions to each piece. Basically
GROUP
> > BY.
> > >> > > >> - Matloff's Software Alchemy, statistical averaging
based on
> > >> > independent
> > >> > > >> chunks of data. This requires rows to be randomly assigned
to
> > >> chunks.
> > >> > > >>
> > >> > > >> Another option besides starting from the R Feather bindings
is
> to
> > >> > start
> > >> > > >> with an automatically generated set of bindings:
> > >> > > >> https://github.com/duncantl/RCodeGen
> > >> > > >>
> > >> > > >> Best,
> > >> > > >> Clark Fitzgerald
> > >> > > >>
> > >> > > > --
> > >> > > > VP of Engineering - dv01, Featured in Forbes Fintech 50
For 2016
> > >> > > > <http://www.forbes.com/fintech/2016/#310668d56680>
> > >> > > > 915 Broadway | Suite 502 | New York, NY 10010
> > >> > > > (646)-838-2310 <(646)%20838-2310>
> > >> > > > dean@dv01.co | www.dv01.co
> > >> > >
> > >> > --
> > >> > VP of Engineering - dv01, Featured in Forbes Fintech 50 For 2016
> > >> > <http://www.forbes.com/fintech/2016/#310668d56680>
> > >> > 915 Broadway | Suite 502 | New York, NY 10010
> > >> > (646)-838-2310
> > >> > dean@dv01.co | www.dv01.co
> > >> >
> > >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message