arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacques Nadeau <jacq...@apache.org>
Subject Re: [DISCUSS][JAVA]Support Fast/Unsafe Vector APIs for Arrow
Date Thu, 02 May 2019 20:00:49 GMT
If someone wants to run without bounds checking, why don't they simply flip
the system property? Are they seeing that code not get eliminated in if
they set that? I think people are optimizing the wrong things in this
discussion. The memory address is available. Per Parth's comments, if
you're working on a specific application, write directly to the memory.
That's the whole point of the reliable memory format. If something isn't
working right with the elimination of bounds checking, we can find another
solution to that and lets make that the ticket.

My other comment on the PR4186 still stands: This doesn't have to be in the
ArrowBuf interface. Because we're factoring out memory as a very simple
concept, it should be easy to simply create a wrapper object that provides
this functionality with no impact on performance. We specifically expose
the memory addressed directly for exactly this type of use. The reality is:
if you want unsafe access, that basically means you don't want guardrails.
Direct memory access is the simplest/cleanest way to expose exactly that.

On Thu, May 2, 2019 at 8:18 AM Siddharth Teotia <siddharth@dremio.com>
wrote:

> Looks like there are 2 PRs for this work --
> https://github.com/apache/arrow/pull/4186 this PR adds new get<type>Unsafe
> type APIs to ArrowBuf that don't do checkIndex() before calling
> PlatformDependent.get(memory address). So the access will go through
> vector.get() -> buffer.get() -> PlatformDependent.get() -> UNSAFE.get which
> is what we do today but without doing any bounds checking
>
> I believe the proposal suggested here and the WIP PR --
> https://github.com/apache/arrow/pull/4212 adds new versions of vectors
> where the call to vector.get() bypasses the call to ArrowBuf and directly
> invokes PlatformDependent with absolute address at which we want to
> read/write. Correct? Looks like the call to arrowbuf is still needed to get
> the starting address of buffer before computing the absolute address
>
> I am wondering if much of the overhead is coming from conditions and
> branches inside bound checking or just the chain of method calls? If it is
> bounds checking then I think the first PR would suffice probably.
>
> On Tue, Apr 30, 2019 at 9:46 AM Parth Chandra <parthc@apache.org> wrote:
>
> > FWIW, in Drill's Value Vector code, we found that bounds checking was a
> > major performance bottleneck in operators that wrote to vectors. Scans,
> as
> > a result, we particularly affected. Another bottleneck was the zeroing of
> > vectors.
> > There were many unnecessary bounds checks. For example in a varchar
> vector,
> > there is one check while writing the data, one while writing the validity
> > bit, one more in the buffer allocator for the data buffer, one more in
> the
> > buffer allocator for the validity bit buffer, one more each in the
> > underlying ByteBuf implementation. It gets worse with repeated/array
> types.
> > Some code paths in Drill were optimized to get rid of these bounds checks
> > (eventually I suppose all of them will be updated). The approach was to
> > bypass the ValueVector API and write directly to the Drill(/Arrow)Buf.
> > Writing to the memory address directly, as is being proposed by Liya Fan,
> > was initially tried but did not have any measurable performance
> > improvements. BTW, writing to the memory address would also conflict with
> > ARROW-3191.
> > Note that the performance tests were for Drill queries, not Vectors, so
> > writing to memory directly may still have a noticeable performance
> benefit
> > for different use cases.
> > Sorry, I don't have actual numbers with me to share and I'm not sure how
> > much Arrow has diverged from the original Drill implementation, but the
> > Drill experience would suggest that this proposal certainly has merit.
> >
> > Parth
> >
> > On Mon, Apr 29, 2019 at 11:18 AM Wes McKinney <wesmckinn@gmail.com>
> wrote:
> >
> > > I'm also curious which APIs are particularly problematic for
> > > performance. In ARROW-1833 [1] and some related discussions there was
> > > the suggestion of adding methods like getUnsafe, so this would be like
> > > get(i) [2] but without checking the validity bitmap
> > >
> > > [1] : https://issues.apache.org/jira/browse/ARROW-1833
> > > [2]:
> > >
> >
> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/Float8Vector.java#L99
> > >
> > > On Mon, Apr 29, 2019 at 1:05 PM Micah Kornfield <emkornfield@gmail.com
> >
> > > wrote:
> > > >
> > > > Thanks for the design.   Personally, I'm not a huge fan of creating a
> > > > parallel classes for every vector type, this ends up being confusing
> > for
> > > > developers and adds a lot of boiler plate.  I wonder if you could
> use a
> > > > similar approach that the memory module uses for turning bounds
> > checking
> > > > on/off [1].
> > > >
> > > > Also, I think there was a comment on the JIRA, but are there any
> > > benchmarks
> > > > to show the expected improvements?  My limited understanding is that
> > for
> > > > small methods the JVM's JIT should inline them anyways [2] , so it is
> > not
> > > > clear how much this will improve performance.
> > > >
> > > >
> > > > Thanks,
> > > > Micah
> > > >
> > > >
> > > > [1]
> > > >
> > >
> >
> https://github.com/apache/arrow/blob/master/java/memory/src/main/java/org/apache/arrow/memory/BoundsChecking.java
> > > > [2]
> > > >
> > >
> >
> https://stackoverflow.com/questions/24923040/do-modern-java-compilers-jvm-inline-functions-methods-which-are-called-exactly-f
> > > >
> > > > On Sun, Apr 28, 2019 at 2:50 AM Fan Liya <liya.fan03@gmail.com>
> wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > We are proposing a new set of APIs in Arrow - unsafe vector APIs.
> The
> > > > > general ideas is attached below, and also accessible from our
> online
> > > > > document
> > > > > <
> > >
> >
> https://docs.google.com/document/d/13oZFVS1EnNedZd_7udx-h10G2tRTjfgHe2ngp2ZWJ70/edit?usp=sharing
> > > >.
> > > > > Please give your valuable comments by directly commenting in our
> > online
> > > > > document
> > > > > <
> > >
> >
> https://docs.google.com/document/d/13oZFVS1EnNedZd_7udx-h10G2tRTjfgHe2ngp2ZWJ70/edit?usp=sharing
> > > >,
> > > > > or relaying this email thread.
> > > > >
> > > > > Thank you so much in advance.
> > > > >
> > > > > Best,
> > > > > Liya Fan
> > > > >
> > > > > Support Fast/Unsafe Vector APIs for Arrow Background
> > > > >
> > > > > In our effort to support columnar data format in Apache Flink, we
> > chose
> > > > > Apache Arrow as the basic data structure. Arrow greatly simplifies
> > the
> > > > > support of the columnar data format. However, for many scenarios,
> we
> > > find
> > > > > the performance unacceptable. Our investigation shows the reason
is
> > > that,
> > > > > there are too many redundant checks and computations in current
> Arrow
> > > API.
> > > > >
> > > > >
> > > > >
> > > > > For example, the following figures shows that in a single call to
> > > > > Float8Vector.get(int) method (this is one of the most frequently
> used
> > > APIs
> > > > > in Flink computation),  there are 20+ method invocations.
> > > > >
> > > > >
> > > > > [image: image.png]
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > There are many other APIs with similar problems. The redundant
> checks
> > > and
> > > > > computations impact performance severely. According to our
> > evaluation,
> > > the
> > > > > performance may degrade by two or three orders of magnitude.
> > > > > Our Proposal
> > > > >
> > > > > For many scenarios, the checks can be avoided, if the application
> > > > > developers can guarantee that all checks will pass. So our proposal
> > is
> > > to
> > > > > provide some light-weight APIs. The APIs are also named *unsafe
> > APIs*,
> > > in
> > > > > the sense that that skip most of the checks (not safe) to improve
> the
> > > > > performance.
> > > > >
> > > > >
> > > > >
> > > > > In the light-weight APIs, we only provide minimum checks, or avoid
> > > checks
> > > > > at all. The application owner can still develop and debug their
> code
> > > using
> > > > > the original safe APIs. Once all bugs have been fixed, they can
> > switch
> > > to
> > > > > unsafe APIs in the final version of their products and enjoy the
> high
> > > > > performance.
> > > > > Our Design
> > > > >
> > > > > Our goal is to include unsafe vector APIs in Arrow code base, and
> > allow
> > > > > our customers switching to the new unsafe APIs, without being aware
> > of
> > > it,
> > > > > except for the high performance. To achieve this goal, we make the
> > > > > following design choices:
> > > > > Vector Class Hierarchy
> > > > >
> > > > > Each unsafe vector is the subclass of the safe vector. For example,
> > the
> > > > > unsafe Float8Vector is a subclass of
> > > org.apache.arrow.vector.Float8Vector:
> > > > >
> > > > >
> > > > >
> > > > > package org.apache.arrow.vector.unsafe;
> > > > >
> > > > >
> > > > >
> > > > > public class Float8Vector extends
> > org.apache.arrow.vector.Float8Vector
> > > > >
> > > > >
> > > > >
> > > > > So the safe vector acts as a fa├žade of the unsafe vector, and
> through
> > > > > polymorphism, the users may not be aware of which type of vector
> > > he/she is
> > > > > working with. In addition, the common logics can be reused in the
> > > unsafe
> > > > > vectors, and we only need to override get/set related methods.
> > > > > Vector Creation
> > > > >
> > > > > We use factory methods to create each type of vectors. Compared
> with
> > > > > vector constructors, the factory methods take one more parameter,
> the
> > > > > vectorType:
> > > > >
> > > > >
> > > > >
> > > > > public class VectorFactory {
> > > > >
> > > > >   public static Float8Vector createFloat8Vector(VectorType
> > vectorType,
> > > > > String name, BufferAllocator allocator);
> > > > >
> > > > > }
> > > > >
> > > > >
> > > > >
> > > > > VectorType is an enum to separate safe vectors from unsafe ones:
> > > > >
> > > > >
> > > > >
> > > > > public enum VectorType {
> > > > >
> > > > >   SAFE,
> > > > >
> > > > >   UNSAFE
> > > > >
> > > > > }
> > > > >
> > > > >
> > > > >
> > > > > With the factory methods, the old way of creating vectors by
> > > constructors
> > > > > can be gradually depreciated.
> > > > > Vector Implementation
> > > > >
> > > > > As discussed above, unsafe vectors mainly override get/set methods.
> > For
> > > > > get methods, we directly operate on the off-heap memory, without
> any
> > > check:
> > > > >
> > > > >
> > > > >
> > > > > public double get(int index) {
> > > > >
> > > > >     return
> > > > >
> > >
> >
> Double.longBitsToDouble(PlatformDependent.getLong(valueBuffer.memoryAddress()
> > > > > + (index << TYPE_LOG2_WIDTH)));
> > > > >
> > > > > }
> > > > >
> > > > >
> > > > >
> > > > > Note that the PlatformDependent API is only 2 stack layers above
> the
> > > > > underlying UNSAFE method call.
> > > > >
> > > > >
> > > > >
> > > > > For set methods, we still need to set the validity bit. However,
> this
> > > is
> > > > > through an unsafe method that directly sets the bits without
> > checking:
> > > > >
> > > > >
> > > > >
> > > > >          public void set(int index, double value) {
> > > > >
> > > > >       UnsafeBitVectorHelper.setValidityBitToOne(validityBuffer,
> > index);
> > > > >
> > > > > PlatformDependent.putLong(
> > > > >
> > > > >             valueBuffer.memoryAddress() + (index <<
> TYPE_LOG2_WIDTH),
> > > > > Double.doubleToRawLongBits(value));
> > > > >
> > > > > }
> > > > >
> > > > >
> > > > >
> > > > > Method UnsafeBitVectorHelper.setValidityBitToOne is the unsafe
> > version
> > > of
> > > > > BitVectorHelper.setValidityBitToOne that avoids checks.
> > > > >
> > > > >
> > > > > Test Cases
> > > > >
> > > > > We can reuse existing test cases by employing parameterized test
> > > classes
> > > > > to test both safe and unsafe vectors.
> > > > > Current Progress
> > > > >
> > > > > We have opened a JIRA for this work item FlINK-5200
> > > > > <https://issues.apache.org/jira/browse/ARROW-5200>, and a PR
> > > > > <https://github.com/apache/arrow/pull/4212> with initial
> > > implementations
> > > > > have been opened. We would appreciate if you could give some
> > comments.
> > > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message