arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fan Liya <liya.fa...@gmail.com>
Subject Re: [DISCUSS][JAVA]Support Fast/Unsafe Vector APIs for Arrow
Date Sun, 05 May 2019 12:54:01 GMT
Hi Jacques,

Thanks a lot for your kind reply.
Please see my comments in line.

Best,
Liya Fan

>
>
> 1. How much slower is the current Arrow API, compared to directly
accessing
> off-heap memory?
>
> According to my (intuitive) experience in vectorizing Flink, the current
> API is much slower, at least one or two orders of magnitude slower.
> I am sorry I do not have the exact number. However, the conclusion can be
> expected to hold true: Parth's experience on Drill also confirms the
> conclusion.
> In fact, we are working on it. ARROW-5209 is about introducing performance
> benchmarks and once that is done, the number will be clear.
>

Are you comparing to a situation where you can crash the JVM versus one
where you cannot? Let's make sure we're comparing apples to apples.

-------

I agree with you that it does not make much sense to compare safe/unsafe
APIs directly.
There is no doubt that the safe API is slow but avoids JVM crashes, whereas
the unsafe API is fast but may cause JVM crashes.

Our goal is not to compare apples with apples. Our goal is to make the best
of both worlds.
Let me illustrate how to achieve this in the scenario of a SQL engine:

1. We develop our engine through the safe API. It is OK even if the
performance is not good. Meantime, we will find many bugs in the code.
2. Once all bugs have been fixed, we switch to the unsafe API through a
single flag, and deliver our product.  We have the confidence that little
or no JVM crash will happen.
3. If we actually encounter a JVM crash (this should happen rarely), we
switch back to the safe API, find and fix the bug, and switch back to the
unsafe API.

-------

>
> 2. Why is current Arrow APIs so slow?
>
> I think the main reason is too many function calls. I believe each
function
> call is highly optimized and only carries out simple work. However, the
> number of calls is large.
> The example in our online doc gives a simple example: a single call to
> Float8Vector.get method (which is an API fundamental enough) involves
> nearly 30 method calls. That is just too much overhead, especially for
> performance-critical scenarios, like SQL engines.
>

Are they? Who is asking that? I haven't heard that feedback at all and we
use the Arrow APIs extensively in Dremio and compete very well with other
SQL engines. The APIs were designed with the perspective that they need to
protect themselves in the context of the JVM so that a random user doesn't
hurt themselves. It sounds like maybe you don't agree with that.

It would be good for you to outline the 30 methods you see as being called
in FloatVector.get method.

In general, I think we should be more focused on the compiled code once it
has been optimized, not the methods. Have you looked at the assembly for
this method that the JIT outputs? The get method should collapse to a very
small number of instructions. If it isn't, we should address that. Have you
done that analysis? Has disabling the bounds checking addressed the issue
for you?

-------

Maybe I need to take a closer look at how the other SQL engines are using
Arrow. To see if they are also bypassing Arrow APIs.
I agree that a random user should be able to protect themselves, and this
is the utmost priority.

According to my experience in Flink, JIT cannot optimize away the checks,
and removing the checks addresses the issue.
I want to illustrate this from two points:

1. Theoretical view point: JIT makes optimizations without changing
semantics of the code, so it can never remove the checks without changing
code semantics. To make it simple, if the JIT has witness the engine
successfully processed 1,000,000 records, how can it be sure that the
1,000,001th record will be successful?

2. Practical view point: we have evaluated our SQL engine on TPC-H 1TB data
set. This is really a large number of records. So the JIT must have done
all it could to improve the code. According to the performance results,
however, it could not eliminate the impact caused checks.

-------

> 3. Can we live without Arrow, and just directly access the off-heap memory
> (e.g. by the UNSAFE instance)?
>
> I guess the answer is absolutely, yes.
> Parth is doing this (bypassing Arrow API) with Drill, and this is exactly
> what we are doing with Flink. My point is that, providing light-weight
APIs
> will make it easier to use Arrow. Without such APIs, Parth may need to
> provide a library of Arrow wrappers in Drill, and we will need to provide
a
> library of Arrow wrappers in Flink, and so on. That's redundant work, and
> it may reduce the popularity of Arrow.


How are you going to come up with a set of APIs that protect the user or
unroll checks? Or you just arguing that the user should not be protected?

-------

Our users should be protected, and we should allow our users to protect
themselves, if they want to.
Formerly, we only give them option A. Now we give them option B.
It is up to the users to make their own choice, according to their specific
requirements.

I know the change seems abrupt. Just think about it. This is a real
requirement from real users.

On Sun, May 5, 2019 at 6:01 PM Jacques Nadeau <jacques@apache.org> wrote:

> >
> >
> > 1. How much slower is the current Arrow API, compared to directly
> accessing
> > off-heap memory?
> >
> > According to my (intuitive) experience in vectorizing Flink, the current
> > API is much slower, at least one or two orders of magnitude slower.
> > I am sorry I do not have the exact number. However, the conclusion can be
> > expected to hold true: Parth's experience on Drill also confirms the
> > conclusion.
> > In fact, we are working on it. ARROW-5209 is about introducing
> performance
> > benchmarks and once that is done, the number will be clear.
> >
>
> Are you comparing to a situation where you can crash the JVM versus one
> where you cannot? Let's make sure we're comparing apples to apples.
>
>
> >
> > 2. Why is current Arrow APIs so slow?
> >
> > I think the main reason is too many function calls. I believe each
> function
> > call is highly optimized and only carries out simple work. However, the
> > number of calls is large.
> > The example in our online doc gives a simple example: a single call to
> > Float8Vector.get method (which is an API fundamental enough) involves
> > nearly 30 method calls. That is just too much overhead, especially for
> > performance-critical scenarios, like SQL engines.
> >
>
> Are they? Who is asking that? I haven't heard that feedback at all and we
> use the Arrow APIs extensively in Dremio and compete very well with other
> SQL engines. The APIs were designed with the perspective that they need to
> protect themselves in the context of the JVM so that a random user doesn't
> hurt themselves. It sounds like maybe you don't agree with that.
>
> It would be good for you to outline the 30 methods you see as being called
> in FloatVector.get method.
>
> In general, I think we should be more focused on the compiled code once it
> has been optimized, not the methods. Have you looked at the assembly for
> this method that the JIT outputs? The get method should collapse to a very
> small number of instructions. If it isn't, we should address that. Have you
> done that analysis? Has disabling the bounds checking addressed the issue
> for you?
>
>
> > 3. Can we live without Arrow, and just directly access the off-heap
> memory
> > (e.g. by the UNSAFE instance)?
> >
> > I guess the answer is absolutely, yes.
> > Parth is doing this (bypassing Arrow API) with Drill, and this is exactly
> > what we are doing with Flink. My point is that, providing light-weight
> APIs
> > will make it easier to use Arrow. Without such APIs, Parth may need to
> > provide a library of Arrow wrappers in Drill, and we will need to
> provide a
> > library of Arrow wrappers in Flink, and so on. That's redundant work, and
> > it may reduce the popularity of Arrow.
>
>
> How are you going to come up with a set of APIs that protect the user or
> unroll checks? Or you just arguing that the user should not be protected?
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message