arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: Discussion: Should we make string/binary types first class Arrow Array types?
Date Mon, 15 Aug 2016 22:38:53 GMT
These IPC details we should definitely document outside of the code.

For the String/Binary type question, I want to start a document that
explains the logical data types in Message.fbs in terms of

- what Arrow memory layout they use (for example: Int32 uses fixed bit
width 32 bits), and String uses List<UInt8> (with the restriction that
the inner buffer must not have any nulls, and the validity bitmap is
omitted)

- any type-specific custom logic around deconstructing or
reconstructing an Arrow container in an IPC/RPC setting. What we have
been debating in this e-mail thread is altering the appearance of
String/Binary representation in a record batch
(https://github.com/apache/arrow/blob/master/format/Message.fbs#L147)
from being identical to List<UInt8> (4 buffers in the flattened buffer
list -- 2 for the list node, 2 for the UInt8 node) to its own
"collapsed" form (3 buffers: bitmap, offsets, data). This means that
any code that is sending/receiving a record batch will need separate
code paths to handle List and String respectively (in the C++ code, we
are currently using the same code path for both)

For example, changes like ARROW-253
(https://github.com/apache/arrow/commit/dc01f099d966b92f4de7679b4a1caf97c363e08e)
would be documented outside of the code and message IDL.

I will open one or more JIRAs and write a patch to try to close the
loop on this.

- Wes

On Tue, Aug 16, 2016 at 6:04 AM, Julien Le Dem <julien@dremio.com> wrote:
> There's ARROW-258 which is about clarifying difference (if any) in metadata
> across RPC (sockets), IPC (shared memory) and files.
> The vector layout is the same except in RPC or files they get concatenated
> together when copied over.
> The metadata should be mostly the same (ideally the same). Buffer offsets
> are relative to the beginning of the body in the context of RPC and file
> start in files. In the context of IPC it looks like we need an extra page id
> (from Message.fbs). Is this correct?
>
> On Mon, Aug 15, 2016 at 12:01 PM, Micah Kornfield <emkornfield@gmail.com>
> wrote:
>>
>> Thanks Wes,
>> This makes sense.  +1 on the "Logical Types / IPC layout
>> document"  is there a JIRA open for this?
>>
>> I'll open a JIRA item to change the inheritance of string/binary in the
>> C++ code base.
>>
>> Thanks,
>> Micah
>>
>> On Sun, Aug 14, 2016 at 10:51 PM, Wes McKinney <wesmckinn@gmail.com>
>> wrote:
>>>
>>> On Fri, Aug 12, 2016 at 5:57 PM, Micah Kornfield <emkornfield@gmail.com>
>>> wrote:
>>> > Sorry for the late reply.
>>> >
>>> > This all sounds reasonable to me.  But I'm not sure I understand
>>> > exactly
>>> > what you mean by
>>> >
>>> >> Accordingly, in the metadata and in RPC/IPC scenarios, binary/string
>>> >> would be a single array unit in the buffer stream and flattened Field
>>> >> metadata rather than nested types (2 array units as they are
>>> >> presently).
>>> >
>>> >
>>> > The way I read it this seems to me to contradict the
>>> > cross-implementation as
>>> > "List<UInt8-not null>"?
>>> >
>>> > Thanks,
>>> > Micah
>>> >
>>>
>>> I think we can resolve this by starting a "Logical Types and IPC/RPC
>>> layout" specification document.
>>>
>>> The schema metadata
>>> (https://github.com/apache/arrow/blob/master/format/Message.fbs) is,
>>> as I understand it, strictly the domain of logical types. I believe
>>> there is some minor conflation of the notions of primitive physical
>>> types and primitive logical types.
>>>
>>> While String / Binary have identical physical layouts to List<UInt8
>>> not null>, in the domain of logical types and IPC, what we are saying
>>> is that these types are:
>>>
>>> - logically speaking: primitive, non-nested types
>>> - their IPC layout is the flattened version of the nested List<UInt8>
>>> counterpart -- a single Field node having String type (with a null
>>> count, etc.), and 3 memory buffers: validity bitmap, offsets, and
>>> data. Structurally on the wire / in shared memory (compared with
>>> List<UInt8 not null>) the only difference is the Field metadata (since
>>> if null count is 0 for the inner UInt8 values, then there is only a
>>> single buffer) -- one node versus two
>>>
>>> Let me know if this does not make sense.
>>>
>>> To move this forward I propose to begin a Logical Types / IPC layout
>>> document and begin to document the mapping between logical types and
>>> their physical in-memory representation and layout on the wire.
>>>
>>> - Wes
>>
>>
>
>
>
> --
> Julien

Mime
View raw message