There was a little bit of effort previously in Arrow to start building this out (see the algorithms package), but we tabled it due to the large scope and availability of maintainers for it.
This has been asked several times in the past but I'm not aware of
anything "dataframe-like" in Java that's build against Arrow (or
otherwise) that fills the kind of need that pandas does. There was a
Scala project some years ago Saddle  (not Arrow-based) built
initially by one of the early pandas developers but I don't think it's
still being actively developed. To build a higher-level Java API on
top of the Arrow Java libraries would be incredibly useful to the
community I'm sure.
On Tue, Mar 16, 2021 at 5:06 PM Paul Whalen <email@example.com> wrote:
> I've been using Arrow for some time now, mostly in the context of Arrow Flight between Java and Python. While it's quite easy to convert Arrow data in Python to a pandas dataframe and manipulate it, I'm struggling to find an obvious analogue on the Java side. VectorSchemaRoot is useful for loading/unloading/moving data, but clumsy for doing higher level operations, especially joins/aggregations/etc across "tables".
> In other words, if I wanted to load non Arrow formatted data from somewhere into Java, manipulate it with a dataframe like API, and then send the result somewhere via Flight, what library would be the best/simplest way to accomplish that? I see lots of progress in other languages, but I'm wondering what would be recommended for Java.
> I'm currently looking at Spark SQL just in-application, but that seems a touch heavyweight, and I'm not sure it would do exactly what I've described (nor am I terribly familiar with Spark in the first place).
> If the premise of this question is flawed, please feel free to correct me.