arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Nuernberger <ch...@techascent.com>
Subject Re: Java dataframe library for arrow suggestions
Date Tue, 16 Mar 2021 23:58:56 GMT
There is a JVM based dataframe library:
https://github.com/techascent/tech.ml.dataset

There are dplyr-like bindings for it: https://github.com/scicloj/tablecloth

It supports mmap/in-place loading of array files (which the Java SDK does
not): https://techascent.com/blog/memory-mapping-arrow.html

And it performs just fine whether you use parquet or arrow:
https://github.com/zero-one-group/geni-performance-benchmark

It also supports graal native compilation so you can have a graal native
executable that reads/writes/mmaps arrow data.

On Tue, Mar 16, 2021 at 5:52 PM Andy Grove <andygrove73@gmail.com> wrote:

> This isn't directly related to the question, but I was reading about the
> newly released JDK 16 today and there is initial support for explicit
> vectorized operations, which might be interesting to explore for anyone
> considering building a Java DataFrame implementation.
>
> https://openjdk.java.net/jeps/338
>
> On Tue, Mar 16, 2021 at 5:43 PM Andrew Melo <andrew.melo@gmail.com> wrote:
>
>> I can't speak to how complete it is, but I looked earlier for
>> something similar and ran across
>> https://github.com/deeplearning4j/nd4j .. it's probably not an exact
>> fit, but it does appear to be able to consume arrow buffers and expose
>> them to java.
>>
>> Cheers
>> Andrew
>>
>> On Tue, Mar 16, 2021 at 6:36 PM Wes McKinney <wesmckinn@gmail.com> wrote:
>> >
>> > This has been asked several times in the past but I'm not aware of
>> > anything "dataframe-like" in Java that's build against Arrow (or
>> > otherwise) that fills the kind of need that pandas does. There was a
>> > Scala project some years ago Saddle [1] (not Arrow-based) built
>> > initially by one of the early pandas developers but I don't think it's
>> > still being actively developed. To build a higher-level Java API on
>> > top of the Arrow Java libraries would be incredibly useful to the
>> > community I'm sure.
>> >
>> > [1]: https://github.com/saddle/saddle
>> >
>> > On Tue, Mar 16, 2021 at 5:06 PM Paul Whalen <pgwhalen@gmail.com> wrote:
>> > >
>> > > Hi,
>> > >
>> > > I've been using Arrow for some time now, mostly in the context of
>> Arrow Flight between Java and Python.  While it's quite easy to convert
>> Arrow data in Python to a pandas dataframe and manipulate it, I'm
>> struggling to find an obvious analogue on the Java side.  VectorSchemaRoot
>> is useful for loading/unloading/moving data, but clumsy for doing higher
>> level operations, especially joins/aggregations/etc across "tables".
>> > >
>> > > In other words, if I wanted to load non Arrow formatted data from
>> somewhere into Java, manipulate it with a dataframe like API, and then send
>> the result somewhere via Flight, what library would be the best/simplest
>> way to accomplish that?  I see lots of progress in other languages, but I'm
>> wondering what would be recommended for Java.
>> > >
>> > > I'm currently looking at Spark SQL just in-application, but that
>> seems a touch heavyweight, and I'm not sure it would do exactly what I've
>> described (nor am I terribly familiar with Spark in the first place).
>> > >
>> > > If the premise of this question is flawed, please feel free to
>> correct me.
>> > >
>> > > Thanks!
>> > > Paul
>>
>

Mime
View raw message