arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Muehlhausen <>
Subject Stored state of incremental writes to fixed size Arrow buffer?
Date Mon, 06 May 2019 12:58:23 GMT

Glad to learn of this project— good work!

If I allocate a single chunk of memory and start building Arrow format
within it, does this chunk save any state regarding my progress?

For example, suppose I allocate a column for floating point (fixed width)
and a column for string (variable width).  Suppose I start building the
floating point column at offset X into my single buffer, and the string
“pointer” column at offset Y into the same single buffer, and the string
data elements at offset Z.

I write one floating point number and one string, then go away.  When I
come back to this buffer to append another value, does the buffer itself
know where I would begin?  I.e. is there a differentiation in the column
(or blob) data itself between the available space and the used space?

Suppose I write a lot of large variable width strings and “run out” of
space for them before running out of space for floating point numbers or
string pointers.  (I guessed badly when doing the original allocation.). I
consider this to be Ok since I can always “copy” the data to “compress out”
the unused fp/pointer buckets... the choice is up to me.

The above applied to a (feather?) file is how I anticipate appending data
to disk... pre-allocate a mem-mapped file and gradually fill it up.  The
efficiency of file utilization will depend on my projections regarding
variable-width data types, but as I said above, I can always re-write the
file if/when this bothers me.

Is this the recommended and supported approach for incremental appends?
I’m really hoping to use Arrow instead of rolling my own, but functionality
like this is absolutely key!  Hoping not to use a side-car file (or memory
chunk) to store “append progress” information.

I am brand new to this project so please forgive me if I have overlooked
something obvious.  And again, looks like great work so far!


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message