arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julien Le Dem <jul...@dremio.com>
Subject Re: Discussion: Should we make string/binary types first class Arrow Array types?
Date Mon, 15 Aug 2016 21:04:58 GMT
There's ARROW-258 which is about clarifying difference (if any) in metadata
across RPC (sockets), IPC (shared memory) and files.
The vector layout is the same except in RPC or files they get concatenated
together when copied over.
The metadata should be mostly the same (ideally the same). Buffer offsets
are relative to the beginning of the body in the context of RPC and file
start in files. In the context of IPC it looks like we need an extra page
id (from Message.fbs). Is this correct?

On Mon, Aug 15, 2016 at 12:01 PM, Micah Kornfield <emkornfield@gmail.com>
wrote:

> Thanks Wes,
> This makes sense.  +1 on the "Logical Types / IPC layout
> document"  is there a JIRA open for this?
>
> I'll open a JIRA item to change the inheritance of string/binary in the
> C++ code base.
>
> Thanks,
> Micah
>
> On Sun, Aug 14, 2016 at 10:51 PM, Wes McKinney <wesmckinn@gmail.com>
> wrote:
>
>> On Fri, Aug 12, 2016 at 5:57 PM, Micah Kornfield <emkornfield@gmail.com>
>> wrote:
>> > Sorry for the late reply.
>> >
>> > This all sounds reasonable to me.  But I'm not sure I understand exactly
>> > what you mean by
>> >
>> >> Accordingly, in the metadata and in RPC/IPC scenarios, binary/string
>> >> would be a single array unit in the buffer stream and flattened Field
>> >> metadata rather than nested types (2 array units as they are
>> >> presently).
>> >
>> >
>> > The way I read it this seems to me to contradict the
>> cross-implementation as
>> > "List<UInt8-not null>"?
>> >
>> > Thanks,
>> > Micah
>> >
>>
>> I think we can resolve this by starting a "Logical Types and IPC/RPC
>> layout" specification document.
>>
>> The schema metadata
>> (https://github.com/apache/arrow/blob/master/format/Message.fbs) is,
>> as I understand it, strictly the domain of logical types. I believe
>> there is some minor conflation of the notions of primitive physical
>> types and primitive logical types.
>>
>> While String / Binary have identical physical layouts to List<UInt8
>> not null>, in the domain of logical types and IPC, what we are saying
>> is that these types are:
>>
>> - logically speaking: primitive, non-nested types
>> - their IPC layout is the flattened version of the nested List<UInt8>
>> counterpart -- a single Field node having String type (with a null
>> count, etc.), and 3 memory buffers: validity bitmap, offsets, and
>> data. Structurally on the wire / in shared memory (compared with
>> List<UInt8 not null>) the only difference is the Field metadata (since
>> if null count is 0 for the inner UInt8 values, then there is only a
>> single buffer) -- one node versus two
>>
>> Let me know if this does not make sense.
>>
>> To move this forward I propose to begin a Logical Types / IPC layout
>> document and begin to document the mapping between logical types and
>> their physical in-memory representation and layout on the wire.
>>
>> - Wes
>>
>
>


-- 
Julien

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message