arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: Discussion: Should we make string/binary types first class Arrow Array types?
Date Tue, 16 Aug 2016 11:03:10 GMT
See ARROW-262

On Mon, Aug 15, 2016 at 3:38 PM, Wes McKinney <wesmckinn@gmail.com> wrote:

> These IPC details we should definitely document outside of the code.
>
> For the String/Binary type question, I want to start a document that
> explains the logical data types in Message.fbs in terms of
>
> - what Arrow memory layout they use (for example: Int32 uses fixed bit
> width 32 bits), and String uses List<UInt8> (with the restriction that
> the inner buffer must not have any nulls, and the validity bitmap is
> omitted)
>
> - any type-specific custom logic around deconstructing or
> reconstructing an Arrow container in an IPC/RPC setting. What we have
> been debating in this e-mail thread is altering the appearance of
> String/Binary representation in a record batch
> (https://github.com/apache/arrow/blob/master/format/Message.fbs#L147)
> from being identical to List<UInt8> (4 buffers in the flattened buffer
> list -- 2 for the list node, 2 for the UInt8 node) to its own
> "collapsed" form (3 buffers: bitmap, offsets, data). This means that
> any code that is sending/receiving a record batch will need separate
> code paths to handle List and String respectively (in the C++ code, we
> are currently using the same code path for both)
>
> For example, changes like ARROW-253
> (https://github.com/apache/arrow/commit/dc01f099d966b92f4de7679b4a1caf
> 97c363e08e)
> would be documented outside of the code and message IDL.
>
> I will open one or more JIRAs and write a patch to try to close the
> loop on this.
>
> - Wes
>
> On Tue, Aug 16, 2016 at 6:04 AM, Julien Le Dem <julien@dremio.com> wrote:
> > There's ARROW-258 which is about clarifying difference (if any) in
> metadata
> > across RPC (sockets), IPC (shared memory) and files.
> > The vector layout is the same except in RPC or files they get
> concatenated
> > together when copied over.
> > The metadata should be mostly the same (ideally the same). Buffer offsets
> > are relative to the beginning of the body in the context of RPC and file
> > start in files. In the context of IPC it looks like we need an extra
> page id
> > (from Message.fbs). Is this correct?
> >
> > On Mon, Aug 15, 2016 at 12:01 PM, Micah Kornfield <emkornfield@gmail.com
> >
> > wrote:
> >>
> >> Thanks Wes,
> >> This makes sense.  +1 on the "Logical Types / IPC layout
> >> document"  is there a JIRA open for this?
> >>
> >> I'll open a JIRA item to change the inheritance of string/binary in the
> >> C++ code base.
> >>
> >> Thanks,
> >> Micah
> >>
> >> On Sun, Aug 14, 2016 at 10:51 PM, Wes McKinney <wesmckinn@gmail.com>
> >> wrote:
> >>>
> >>> On Fri, Aug 12, 2016 at 5:57 PM, Micah Kornfield <
> emkornfield@gmail.com>
> >>> wrote:
> >>> > Sorry for the late reply.
> >>> >
> >>> > This all sounds reasonable to me.  But I'm not sure I understand
> >>> > exactly
> >>> > what you mean by
> >>> >
> >>> >> Accordingly, in the metadata and in RPC/IPC scenarios, binary/string
> >>> >> would be a single array unit in the buffer stream and flattened
> Field
> >>> >> metadata rather than nested types (2 array units as they are
> >>> >> presently).
> >>> >
> >>> >
> >>> > The way I read it this seems to me to contradict the
> >>> > cross-implementation as
> >>> > "List<UInt8-not null>"?
> >>> >
> >>> > Thanks,
> >>> > Micah
> >>> >
> >>>
> >>> I think we can resolve this by starting a "Logical Types and IPC/RPC
> >>> layout" specification document.
> >>>
> >>> The schema metadata
> >>> (https://github.com/apache/arrow/blob/master/format/Message.fbs) is,
> >>> as I understand it, strictly the domain of logical types. I believe
> >>> there is some minor conflation of the notions of primitive physical
> >>> types and primitive logical types.
> >>>
> >>> While String / Binary have identical physical layouts to List<UInt8
> >>> not null>, in the domain of logical types and IPC, what we are saying
> >>> is that these types are:
> >>>
> >>> - logically speaking: primitive, non-nested types
> >>> - their IPC layout is the flattened version of the nested List<UInt8>
> >>> counterpart -- a single Field node having String type (with a null
> >>> count, etc.), and 3 memory buffers: validity bitmap, offsets, and
> >>> data. Structurally on the wire / in shared memory (compared with
> >>> List<UInt8 not null>) the only difference is the Field metadata (since
> >>> if null count is 0 for the inner UInt8 values, then there is only a
> >>> single buffer) -- one node versus two
> >>>
> >>> Let me know if this does not make sense.
> >>>
> >>> To move this forward I propose to begin a Logical Types / IPC layout
> >>> document and begin to document the mapping between logical types and
> >>> their physical in-memory representation and layout on the wire.
> >>>
> >>> - Wes
> >>
> >>
> >
> >
> >
> > --
> > Julien
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message