arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zheng, Kai" <kai.zh...@intel.com>
Subject RE: Shared memory "IPC" of Arrow row batches in C++
Date Mon, 21 Mar 2016 15:40:45 GMT
Thanks Wes. This sounds a good starting on the IPC direction. 

>> It's be great to get some benchmark code written so that we are also able to make
technical decisions on the basis of measurable performance implications. 
Is there any bootstrap setup for the benchmark thing, here, Parquet or elsewhere we can borrow?
Does it mean we'll compare two or more approaches or just measure the performance of code
path like the read-path you said? For the cpp part, benchmark codes in c++ or python, preferred?

>>For example, while the read-path of the above code does not copy any data, it would
be useful to know how fast reassembling the row batch data structure is and how this scales
with the number of columns.
I guess it mean the data is in columnar, the read-path will reassemble it into row batches,
without any data copy (by pointers), right?

Bear me if something stupid. Thanks!

Regards,
Kai

-----Original Message-----
From: Wes McKinney [mailto:wes@cloudera.com] 
Sent: Saturday, March 19, 2016 2:06 PM
To: dev@arrow.apache.org
Subject: Shared memory "IPC" of Arrow row batches in C++

I’ve been collaborating with Steven Phillips (who’s been working on the Java Arrow impl
recently) to show a proof of concept ping-ponging Arrow data back and forth between the Java
and C++ implementations. We aren’t 100% there yet, but I got C++ to C++ round-trip to memory
map working today (for primitive types — e.g. integers):

https://github.com/apache/arrow/pull/28

We created a small metadata specification using Flatbuffers IDL — feedback would be much
desired here:

https://github.com/apache/arrow/pull/28/files#diff-520b20e87eb508faa3cc7aa9855030d7

This includes:

- Logical schemas
- Data headers: compact descriptions row batches associated with a particular schema

The idea is that two systems agree up front on “what is the schema” so that only the data
header (containing memory offsets and sizes and some other important data-dependent metadata).
After working through this in some real code, I’m feeling fairly good that it meets the
needs of Arrow for the time being, but there may be some unknown requirements that it would
be good to learn about sooner than later.
After some design review and iteration we’ll want to document the metadata specification
as part of the format in more gory detail.

(Note: We are using Flatbuffers for convenience, performance, and development simplicity —
one feature that is especially nice is its union support, but it can be done in other serialization
tools, too)

It's be great to get some benchmark code written so that we are also able to make technical
decisions on the basis of measurable performance implications. For example, while the read-path
of the above code does not copy any data, it would be useful to know how fast reassembling
the row batch data structure is and how this scales with the number of columns.

best regards,
Wes
Mime
View raw message