arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Micah Kornfield <emkornfi...@gmail.com>
Subject Re: Discussion: Should we make string/binary types first class Arrow Array types?
Date Mon, 15 Aug 2016 19:01:25 GMT
Thanks Wes,
This makes sense.  +1 on the "Logical Types / IPC layout
document"  is there a JIRA open for this?

I'll open a JIRA item to change the inheritance of string/binary in the C++
code base.

Thanks,
Micah

On Sun, Aug 14, 2016 at 10:51 PM, Wes McKinney <wesmckinn@gmail.com> wrote:

> On Fri, Aug 12, 2016 at 5:57 PM, Micah Kornfield <emkornfield@gmail.com>
> wrote:
> > Sorry for the late reply.
> >
> > This all sounds reasonable to me.  But I'm not sure I understand exactly
> > what you mean by
> >
> >> Accordingly, in the metadata and in RPC/IPC scenarios, binary/string
> >> would be a single array unit in the buffer stream and flattened Field
> >> metadata rather than nested types (2 array units as they are
> >> presently).
> >
> >
> > The way I read it this seems to me to contradict the
> cross-implementation as
> > "List<UInt8-not null>"?
> >
> > Thanks,
> > Micah
> >
>
> I think we can resolve this by starting a "Logical Types and IPC/RPC
> layout" specification document.
>
> The schema metadata
> (https://github.com/apache/arrow/blob/master/format/Message.fbs) is,
> as I understand it, strictly the domain of logical types. I believe
> there is some minor conflation of the notions of primitive physical
> types and primitive logical types.
>
> While String / Binary have identical physical layouts to List<UInt8
> not null>, in the domain of logical types and IPC, what we are saying
> is that these types are:
>
> - logically speaking: primitive, non-nested types
> - their IPC layout is the flattened version of the nested List<UInt8>
> counterpart -- a single Field node having String type (with a null
> count, etc.), and 3 memory buffers: validity bitmap, offsets, and
> data. Structurally on the wire / in shared memory (compared with
> List<UInt8 not null>) the only difference is the Field metadata (since
> if null count is 0 for the inner UInt8 values, then there is only a
> single buffer) -- one node versus two
>
> Let me know if this does not make sense.
>
> To move this forward I propose to begin a Logical Types / IPC layout
> document and begin to document the mapping between logical types and
> their physical in-memory representation and layout on the wire.
>
> - Wes
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message