arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: [Discuss][Java, Non-C++ generally] Support for 64-bit int array lengths?
Date Thu, 14 Mar 2019 13:52:59 GMT
hi Micah,

Given the constraints from Netty in Java, I would say that it makes
sense to raise an exception if encountering a Field length exceeding
2^31 - 1 in length (I think there are already some checks, but we can
add more checks during the IPC metadata read pass). With shared memory
/ zero copy in Java happening _eventually_
(https://issues.apache.org/jira/browse/ARROW-3191) this is becoming
more of a realistic issue, since someone may produce a massive dataset
and then try to read it in Java.

64-bit variable-size offsets (i.e. LargeList, LargeBinary /
LargeString) are a different matter. A list or varbinary vector could
have 64-bit offsets, that should not cause any issues. We need these
in C++ to unblock some real-world use cases with embedding large
objects in Arrow data structures and reading them from shared memory
with zero copy. If an implementation is unable to read such huge data
structures due to structural limitations we need only document this.

- Wes

On Thu, Mar 14, 2019 at 4:41 AM Ravindra Pindikura <ravindra@dremio.com> wrote:
>
> @Jacques Nadeau <jacques@dremio.com> would have more background on this.
> Here's my understanding :
>
> On Thu, Mar 14, 2019 at 12:08 PM Micah Kornfield <emkornfield@gmail.com>
> wrote:
>
> > I was working on a proof of concept java implementation for LargeList  [1]
> > implementation (64-bit array offsets).  Our Java implementation doesn't
> > appear to support Vectors/Arrays larger then Integer.MAX_VALUE addressable
> > space.
> >
> > It looks like Message.fbs was updated quite a while ago to support 64-bit
> > lengths/offsets [2].  I had some questions:
> >
> > 1.  For Java:
> >   * Is my assessment accurate that is doesn't support 64-bit ranged sizes?
> >
>
> yes.
>
>
> >   * Is there a desire to support the 64 bit sizes? (I didn't come across
> > any JIRAs when I did a search)
> >
>
> no, afaik.
>
>
> >  *  Is there a technical blocker for doing so?
> >
>
> - big change
> - arrow uses the netty allocator. that also uses int (32-bit) for capacity.
>
> https://netty.io/4.0/xref/io/netty/buffer/ByteBufAllocator.html#84
>
>
>  * Any thoughts on approach for doing such a large change (I'm mostly
> > concerned with breaking existing consumers/performance regressions)?
> >    - Given that the Java code base appears relatively stable, it might be
> > that forking and creating a version "2.0" is the best viable option.
> >
> > 2.  For other language implementations, is there support for 64-bit sizes
> > or only 32-bit?
> >
> > Thanks,
> > Micah
> >
> > P.S. It looks like our spec docs are out of date in regards to this issue,
> > they still list Int::MAX_VALUE as the largest possible array, it is on my
> > plate to update and consolidate them.
> >
> > [1] https://issues.apache.org/jira/browse/ARROW-4810
> > [2]
> >
> > https://github.com/apache/arrow/commit/ced9d766d70e84c4d0542c6f5d9bd57faf10781d
> >
>
>
> --
> Thanks and regards,
> Ravindra.

Mime
View raw message