arrow-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Johan Peltenburg - EWI <J.W.Peltenb...@tudelft.nl>
Subject Re: Beginner Question: HW Input into Arrow RecordBatch
Date Wed, 17 Jul 2019 12:01:00 GMT
Hi Simon,


> would there be a way to "reinterpret" this in-memory layout as an Arrow buffer/RecordBatch/whatever
and therby avoid copy operations?


I think you have two options.


Probably the most applicable/fastest: if you always have a fixed number of values per sample,
you might want to try FixedSizeBinary.

I haven't used it myself yet but I think its value buffer should look exactly like your one-dimensional
C-array.

You can then unwrap individual samples later on in the downstream of your pipeline.


Another possibility is that you could consider your one-dimensional C-array as the "Arrow
values buffer" of an "Arrow list<int8> array", where every list in the Arrow array has
length 16.

In fact, the format spec shows an example for list<char> here which is almost the same:
https://arrow.apache.org/docs/format/Layout.html

The only drawback would be that you'd also need to create an offsets buffer if you'd want
to continue to read that data later on, through Arrow's API. That will also be slower that
FixedSizeBinary, as you have an added level of indirection (memory latency) when you want
to access a value.


Hope this helps, and good luck,


Johan

________________________________
From: Simon Dumke <simon.dumke@ipp.mpg.de>
Sent: Wednesday, July 17, 2019 11:27:58 AM
To: user@arrow.apache.org
Subject: Beginner Question: HW Input into Arrow RecordBatch


Dear all,

I'm just starting into Apache Arrow (or more like thinking about it). I'm also thinking about
using Arrow not only inside our porcessing pipeline, but auf data acwuisition pipeline too.
Regarding this, I have the following Question:

There are primarily two kinds of DAQ APIs in use here:

  *   One [e.g. like int getData(unsigned char *data, size_t bufferSize)] takes a pointer
to a preallocated buffer and fills it with data from DAQ hardware
  *   The other [e.g. like int getData(unsigned char **data)] "returns" a pointer to a buffer
created inside the hardware driver, filled with data from DAQ hardware

If I want to use Arrow to transport and handle the data coming out of those APIs, I would
usually need to allocate an Arrow Buffer and (with a sweep of copy operations) parse the acquired
data into it. If the hardware's output is an interlaced stream of samples (e.g. 16 8bit values
from a 16-channel ADC, followed by the 16 values of the next sample...), that would obviously
be row-oriented and i would therefore need to parse it manually into the Arrow buffer.

The question is now: If the data is only a one-dimensional array of samples (like from a single
channel ADC) or the hardware offers the option to fill the buffer in a non-interlace / planar
manner (meaning all samples from channle 0, followed by all samples of channel 1 and so on
- essentially "columnar") - would there be a way to "reinterpret" this in-memory layout as
an Arrow buffer/RecordBatch/whatever and therby avoid copy operations? e.g. by adding a specific
"header", or, when using an API of the first type, by providing a pointer into a buffer allocated
by Arrow and already prepared for the specific content layout?

I hope, my question and intention coms through clear enough. Any ideas would be greatly appreciated!

BTW - can anybody offer some links with Getting Started Guides, examples etc. how to start
using Arrow (both C++ and Java)? I find myself still having dificulties finding the right
starting point.

Many Thanks and kind regards,

Simon

--
----
Simon Dumke

Developer - CoDaC
Department Operation

Max Planck Institut for Plasmaphysics


Mime
View raw message