arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacques Nadeau <jacq...@dremio.com>
Subject Re: [Discuss][Java, Non-C++ generally] Support for 64-bit int array lengths?
Date Fri, 15 Mar 2019 03:38:42 GMT
I definitely thinks it makes sense to introduce a second list vector and
enhance complexwriter/field reader to support longs in Java. I wouldn't
replace the existing list vector or associated apis.

I'm up for ArrowBuf changes to use long indexes after we get it pointing at
arbitrary memory as well.

On Thu, Mar 14, 2019, 6:53 AM Wes McKinney <wesmckinn@gmail.com> wrote:

> hi Micah,
>
> Given the constraints from Netty in Java, I would say that it makes
> sense to raise an exception if encountering a Field length exceeding
> 2^31 - 1 in length (I think there are already some checks, but we can
> add more checks during the IPC metadata read pass). With shared memory
> / zero copy in Java happening _eventually_
> (https://issues.apache.org/jira/browse/ARROW-3191) this is becoming
> more of a realistic issue, since someone may produce a massive dataset
> and then try to read it in Java.
>
> 64-bit variable-size offsets (i.e. LargeList, LargeBinary /
> LargeString) are a different matter. A list or varbinary vector could
> have 64-bit offsets, that should not cause any issues. We need these
> in C++ to unblock some real-world use cases with embedding large
> objects in Arrow data structures and reading them from shared memory
> with zero copy. If an implementation is unable to read such huge data
> structures due to structural limitations we need only document this.
>
> - Wes
>
> On Thu, Mar 14, 2019 at 4:41 AM Ravindra Pindikura <ravindra@dremio.com>
> wrote:
> >
> > @Jacques Nadeau <jacques@dremio.com> would have more background on this.
> > Here's my understanding :
> >
> > On Thu, Mar 14, 2019 at 12:08 PM Micah Kornfield <emkornfield@gmail.com>
> > wrote:
> >
> > > I was working on a proof of concept java implementation for LargeList
> [1]
> > > implementation (64-bit array offsets).  Our Java implementation doesn't
> > > appear to support Vectors/Arrays larger then Integer.MAX_VALUE
> addressable
> > > space.
> > >
> > > It looks like Message.fbs was updated quite a while ago to support
> 64-bit
> > > lengths/offsets [2].  I had some questions:
> > >
> > > 1.  For Java:
> > >   * Is my assessment accurate that is doesn't support 64-bit ranged
> sizes?
> > >
> >
> > yes.
> >
> >
> > >   * Is there a desire to support the 64 bit sizes? (I didn't come
> across
> > > any JIRAs when I did a search)
> > >
> >
> > no, afaik.
> >
> >
> > >  *  Is there a technical blocker for doing so?
> > >
> >
> > - big change
> > - arrow uses the netty allocator. that also uses int (32-bit) for
> capacity.
> >
> > https://netty.io/4.0/xref/io/netty/buffer/ByteBufAllocator.html#84
> >
> >
> >  * Any thoughts on approach for doing such a large change (I'm mostly
> > > concerned with breaking existing consumers/performance regressions)?
> > >    - Given that the Java code base appears relatively stable, it might
> be
> > > that forking and creating a version "2.0" is the best viable option.
> > >
> > > 2.  For other language implementations, is there support for 64-bit
> sizes
> > > or only 32-bit?
> > >
> > > Thanks,
> > > Micah
> > >
> > > P.S. It looks like our spec docs are out of date in regards to this
> issue,
> > > they still list Int::MAX_VALUE as the largest possible array, it is on
> my
> > > plate to update and consolidate them.
> > >
> > > [1] https://issues.apache.org/jira/browse/ARROW-4810
> > > [2]
> > >
> > >
> https://github.com/apache/arrow/commit/ced9d766d70e84c4d0542c6f5d9bd57faf10781d
> > >
> >
> >
> > --
> > Thanks and regards,
> > Ravindra.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message