arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Siddharth Teotia <siddha...@dremio.com>
Subject Re: [DISCUSS][JAVA]Support Fast/Unsafe Vector APIs for Arrow
Date Thu, 02 May 2019 04:17:46 GMT
Looks like there are 2 PRs for this work --
https://github.com/apache/arrow/pull/4186 this PR adds new get<type>Unsafe
type APIs to ArrowBuf that don't do checkIndex() before calling
PlatformDependent.get(memory address). So the access will go through
vector.get() -> buffer.get() -> PlatformDependent.get() -> UNSAFE.get which
is what we do today but without doing any bounds checking

I believe the proposal suggested here and the WIP PR --
https://github.com/apache/arrow/pull/4212 adds new versions of vectors
where the call to vector.get() bypasses the call to ArrowBuf and directly
invokes PlatformDependent with absolute address at which we want to
read/write. Correct? Looks like the call to arrowbuf is still needed to get
the starting address of buffer before computing the absolute address

I am wondering if much of the overhead is coming from conditions and
branches inside bound checking or just the chain of method calls? If it is
bounds checking then I think the first PR would suffice probably.

On Tue, Apr 30, 2019 at 9:46 AM Parth Chandra <parthc@apache.org> wrote:

> FWIW, in Drill's Value Vector code, we found that bounds checking was a
> major performance bottleneck in operators that wrote to vectors. Scans, as
> a result, we particularly affected. Another bottleneck was the zeroing of
> vectors.
> There were many unnecessary bounds checks. For example in a varchar vector,
> there is one check while writing the data, one while writing the validity
> bit, one more in the buffer allocator for the data buffer, one more in the
> buffer allocator for the validity bit buffer, one more each in the
> underlying ByteBuf implementation. It gets worse with repeated/array types.
> Some code paths in Drill were optimized to get rid of these bounds checks
> (eventually I suppose all of them will be updated). The approach was to
> bypass the ValueVector API and write directly to the Drill(/Arrow)Buf.
> Writing to the memory address directly, as is being proposed by Liya Fan,
> was initially tried but did not have any measurable performance
> improvements. BTW, writing to the memory address would also conflict with
> ARROW-3191.
> Note that the performance tests were for Drill queries, not Vectors, so
> writing to memory directly may still have a noticeable performance benefit
> for different use cases.
> Sorry, I don't have actual numbers with me to share and I'm not sure how
> much Arrow has diverged from the original Drill implementation, but the
> Drill experience would suggest that this proposal certainly has merit.
>
> Parth
>
> On Mon, Apr 29, 2019 at 11:18 AM Wes McKinney <wesmckinn@gmail.com> wrote:
>
> > I'm also curious which APIs are particularly problematic for
> > performance. In ARROW-1833 [1] and some related discussions there was
> > the suggestion of adding methods like getUnsafe, so this would be like
> > get(i) [2] but without checking the validity bitmap
> >
> > [1] : https://issues.apache.org/jira/browse/ARROW-1833
> > [2]:
> >
> https://github.com/apache/arrow/blob/master/java/vector/src/main/java/org/apache/arrow/vector/Float8Vector.java#L99
> >
> > On Mon, Apr 29, 2019 at 1:05 PM Micah Kornfield <emkornfield@gmail.com>
> > wrote:
> > >
> > > Thanks for the design.   Personally, I'm not a huge fan of creating a
> > > parallel classes for every vector type, this ends up being confusing
> for
> > > developers and adds a lot of boiler plate.  I wonder if you could use a
> > > similar approach that the memory module uses for turning bounds
> checking
> > > on/off [1].
> > >
> > > Also, I think there was a comment on the JIRA, but are there any
> > benchmarks
> > > to show the expected improvements?  My limited understanding is that
> for
> > > small methods the JVM's JIT should inline them anyways [2] , so it is
> not
> > > clear how much this will improve performance.
> > >
> > >
> > > Thanks,
> > > Micah
> > >
> > >
> > > [1]
> > >
> >
> https://github.com/apache/arrow/blob/master/java/memory/src/main/java/org/apache/arrow/memory/BoundsChecking.java
> > > [2]
> > >
> >
> https://stackoverflow.com/questions/24923040/do-modern-java-compilers-jvm-inline-functions-methods-which-are-called-exactly-f
> > >
> > > On Sun, Apr 28, 2019 at 2:50 AM Fan Liya <liya.fan03@gmail.com> wrote:
> > >
> > > > Hi all,
> > > >
> > > > We are proposing a new set of APIs in Arrow - unsafe vector APIs. The
> > > > general ideas is attached below, and also accessible from our online
> > > > document
> > > > <
> >
> https://docs.google.com/document/d/13oZFVS1EnNedZd_7udx-h10G2tRTjfgHe2ngp2ZWJ70/edit?usp=sharing
> > >.
> > > > Please give your valuable comments by directly commenting in our
> online
> > > > document
> > > > <
> >
> https://docs.google.com/document/d/13oZFVS1EnNedZd_7udx-h10G2tRTjfgHe2ngp2ZWJ70/edit?usp=sharing
> > >,
> > > > or relaying this email thread.
> > > >
> > > > Thank you so much in advance.
> > > >
> > > > Best,
> > > > Liya Fan
> > > >
> > > > Support Fast/Unsafe Vector APIs for Arrow Background
> > > >
> > > > In our effort to support columnar data format in Apache Flink, we
> chose
> > > > Apache Arrow as the basic data structure. Arrow greatly simplifies
> the
> > > > support of the columnar data format. However, for many scenarios, we
> > find
> > > > the performance unacceptable. Our investigation shows the reason is
> > that,
> > > > there are too many redundant checks and computations in current Arrow
> > API.
> > > >
> > > >
> > > >
> > > > For example, the following figures shows that in a single call to
> > > > Float8Vector.get(int) method (this is one of the most frequently used
> > APIs
> > > > in Flink computation),  there are 20+ method invocations.
> > > >
> > > >
> > > > [image: image.png]
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > There are many other APIs with similar problems. The redundant checks
> > and
> > > > computations impact performance severely. According to our
> evaluation,
> > the
> > > > performance may degrade by two or three orders of magnitude.
> > > > Our Proposal
> > > >
> > > > For many scenarios, the checks can be avoided, if the application
> > > > developers can guarantee that all checks will pass. So our proposal
> is
> > to
> > > > provide some light-weight APIs. The APIs are also named *unsafe
> APIs*,
> > in
> > > > the sense that that skip most of the checks (not safe) to improve the
> > > > performance.
> > > >
> > > >
> > > >
> > > > In the light-weight APIs, we only provide minimum checks, or avoid
> > checks
> > > > at all. The application owner can still develop and debug their code
> > using
> > > > the original safe APIs. Once all bugs have been fixed, they can
> switch
> > to
> > > > unsafe APIs in the final version of their products and enjoy the high
> > > > performance.
> > > > Our Design
> > > >
> > > > Our goal is to include unsafe vector APIs in Arrow code base, and
> allow
> > > > our customers switching to the new unsafe APIs, without being aware
> of
> > it,
> > > > except for the high performance. To achieve this goal, we make the
> > > > following design choices:
> > > > Vector Class Hierarchy
> > > >
> > > > Each unsafe vector is the subclass of the safe vector. For example,
> the
> > > > unsafe Float8Vector is a subclass of
> > org.apache.arrow.vector.Float8Vector:
> > > >
> > > >
> > > >
> > > > package org.apache.arrow.vector.unsafe;
> > > >
> > > >
> > > >
> > > > public class Float8Vector extends
> org.apache.arrow.vector.Float8Vector
> > > >
> > > >
> > > >
> > > > So the safe vector acts as a fa├žade of the unsafe vector, and through
> > > > polymorphism, the users may not be aware of which type of vector
> > he/she is
> > > > working with. In addition, the common logics can be reused in the
> > unsafe
> > > > vectors, and we only need to override get/set related methods.
> > > > Vector Creation
> > > >
> > > > We use factory methods to create each type of vectors. Compared with
> > > > vector constructors, the factory methods take one more parameter, the
> > > > vectorType:
> > > >
> > > >
> > > >
> > > > public class VectorFactory {
> > > >
> > > >   public static Float8Vector createFloat8Vector(VectorType
> vectorType,
> > > > String name, BufferAllocator allocator);
> > > >
> > > > }
> > > >
> > > >
> > > >
> > > > VectorType is an enum to separate safe vectors from unsafe ones:
> > > >
> > > >
> > > >
> > > > public enum VectorType {
> > > >
> > > >   SAFE,
> > > >
> > > >   UNSAFE
> > > >
> > > > }
> > > >
> > > >
> > > >
> > > > With the factory methods, the old way of creating vectors by
> > constructors
> > > > can be gradually depreciated.
> > > > Vector Implementation
> > > >
> > > > As discussed above, unsafe vectors mainly override get/set methods.
> For
> > > > get methods, we directly operate on the off-heap memory, without any
> > check:
> > > >
> > > >
> > > >
> > > > public double get(int index) {
> > > >
> > > >     return
> > > >
> >
> Double.longBitsToDouble(PlatformDependent.getLong(valueBuffer.memoryAddress()
> > > > + (index << TYPE_LOG2_WIDTH)));
> > > >
> > > > }
> > > >
> > > >
> > > >
> > > > Note that the PlatformDependent API is only 2 stack layers above the
> > > > underlying UNSAFE method call.
> > > >
> > > >
> > > >
> > > > For set methods, we still need to set the validity bit. However, this
> > is
> > > > through an unsafe method that directly sets the bits without
> checking:
> > > >
> > > >
> > > >
> > > >          public void set(int index, double value) {
> > > >
> > > >       UnsafeBitVectorHelper.setValidityBitToOne(validityBuffer,
> index);
> > > >
> > > > PlatformDependent.putLong(
> > > >
> > > >             valueBuffer.memoryAddress() + (index << TYPE_LOG2_WIDTH),
> > > > Double.doubleToRawLongBits(value));
> > > >
> > > > }
> > > >
> > > >
> > > >
> > > > Method UnsafeBitVectorHelper.setValidityBitToOne is the unsafe
> version
> > of
> > > > BitVectorHelper.setValidityBitToOne that avoids checks.
> > > >
> > > >
> > > > Test Cases
> > > >
> > > > We can reuse existing test cases by employing parameterized test
> > classes
> > > > to test both safe and unsafe vectors.
> > > > Current Progress
> > > >
> > > > We have opened a JIRA for this work item FlINK-5200
> > > > <https://issues.apache.org/jira/browse/ARROW-5200>, and a PR
> > > > <https://github.com/apache/arrow/pull/4212> with initial
> > implementations
> > > > have been opened. We would appreciate if you could give some
> comments.
> > > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message