arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacques Nadeau <>
Subject Re: [DISCUSS][JAVA]Support Fast/Unsafe Vector APIs for Arrow
Date Sun, 05 May 2019 10:01:24 GMT
> 1. How much slower is the current Arrow API, compared to directly accessing
> off-heap memory?
> According to my (intuitive) experience in vectorizing Flink, the current
> API is much slower, at least one or two orders of magnitude slower.
> I am sorry I do not have the exact number. However, the conclusion can be
> expected to hold true: Parth's experience on Drill also confirms the
> conclusion.
> In fact, we are working on it. ARROW-5209 is about introducing performance
> benchmarks and once that is done, the number will be clear.

Are you comparing to a situation where you can crash the JVM versus one
where you cannot? Let's make sure we're comparing apples to apples.

> 2. Why is current Arrow APIs so slow?
> I think the main reason is too many function calls. I believe each function
> call is highly optimized and only carries out simple work. However, the
> number of calls is large.
> The example in our online doc gives a simple example: a single call to
> Float8Vector.get method (which is an API fundamental enough) involves
> nearly 30 method calls. That is just too much overhead, especially for
> performance-critical scenarios, like SQL engines.

Are they? Who is asking that? I haven't heard that feedback at all and we
use the Arrow APIs extensively in Dremio and compete very well with other
SQL engines. The APIs were designed with the perspective that they need to
protect themselves in the context of the JVM so that a random user doesn't
hurt themselves. It sounds like maybe you don't agree with that.

It would be good for you to outline the 30 methods you see as being called
in FloatVector.get method.

In general, I think we should be more focused on the compiled code once it
has been optimized, not the methods. Have you looked at the assembly for
this method that the JIT outputs? The get method should collapse to a very
small number of instructions. If it isn't, we should address that. Have you
done that analysis? Has disabling the bounds checking addressed the issue
for you?

> 3. Can we live without Arrow, and just directly access the off-heap memory
> (e.g. by the UNSAFE instance)?
> I guess the answer is absolutely, yes.
> Parth is doing this (bypassing Arrow API) with Drill, and this is exactly
> what we are doing with Flink. My point is that, providing light-weight APIs
> will make it easier to use Arrow. Without such APIs, Parth may need to
> provide a library of Arrow wrappers in Drill, and we will need to provide a
> library of Arrow wrappers in Flink, and so on. That's redundant work, and
> it may reduce the popularity of Arrow.

How are you going to come up with a set of APIs that protect the user or
unroll checks? Or you just arguing that the user should not be protected?

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message