arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Muehlhausen <>
Subject Re: Stored state of incremental writes to fixed size Arrow buffer?
Date Mon, 06 May 2019 14:31:18 GMT

I’m not afraid of writing my own C++ code to deal with all of this on the
writer side.  I just need a way to “append” (incrementally populate) e.g.
feather files so that a person using e.g. pyarrow doesn’t suffer some
catastrophic failure... and “on the side” I tell them which rows are junk
and deal with any concurrency issues that can’t be solved in the arena of
atomicity and ordering of ops.  For now I care about basic types but
including variable-width strings.

For event-processing, I think Arrow has to have the concept of a partially
full record set.  Some alternatives are:
- have a batch size of one, thus littering the landscape with trivially
small Arrow buffers
- artificially increase latency with a batch size larger than one, but not
processing any data until a batch is complete
- continuously re-write the (entire!) “main” buffer as batches of length 1
roll in
- instead of one main buffer, several, and at some threshold combine the
last N length-1 batches into a length N buffer ... still an inefficiency

Consider the case of QAbstractTableModel as the underlying data for a table
or a chart.  This visualization shows all of the data for the recent past
as well as events rolling in.  If this model interface is implemented as a
view onto “many thousands” of individual event buffers then we gain nothing
from columnar layout.  (Suppose there are tons of columns and most of them
are scrolled out of the view.). Likewise we cannot re-write the entire
model on each event... time complexity blows up.  What we want is to have a
large pre-allocated chunk and just change rowCount() as data is “appended.”
 Sure, we may run out of space and have another and another chunk for
future row ranges, but a handful of chunks chained together is better than
as many chunks as there were events!

And again, having a batch size >1 and delaying the data until a batch is
full is a non-starter.

I am really hoping to see partially-filled buffers as something we keep our
finger on moving forward!  Or else, what am I missing?


On Mon, May 6, 2019 at 8:24 AM Wes McKinney <> wrote:

> hi John,
> In C++ the builder classes don't yet support writing into preallocated
> memory. It would be tricky for applications to determine a priori
> which segments of memory to pass to the builder. It seems only
> feasible for primitive / fixed-size types so my guess would be that a
> separate set of interfaces would need to be developed for this task.
> - Wes
> On Mon, May 6, 2019 at 8:18 AM Jacques Nadeau <> wrote:
> >
> > This is more of a question of implementation versus specification. An
> arrow
> > buffer is generally built and then sealed. In different languages, this
> > building process works differently (a concern of the language rather than
> > the memory specification). We don't currently allow a half built vector
> to
> > be moved to another language and then be further built. So the question
> is
> > really more concrete: what language are you looking at and what is the
> > specific pattern you're trying to undertake for building.
> >
> > If you're trying to go across independent processes (whether the same
> > process restarted or two separate processes active simultaneously) you'll
> > need to build up your own data structures to help with this.
> >
> > On Mon, May 6, 2019 at 6:28 PM John Muehlhausen <> wrote:
> >
> > > Hello,
> > >
> > > Glad to learn of this project— good work!
> > >
> > > If I allocate a single chunk of memory and start building Arrow format
> > > within it, does this chunk save any state regarding my progress?
> > >
> > > For example, suppose I allocate a column for floating point (fixed
> width)
> > > and a column for string (variable width).  Suppose I start building the
> > > floating point column at offset X into my single buffer, and the string
> > > “pointer” column at offset Y into the same single buffer, and the
> string
> > > data elements at offset Z.
> > >
> > > I write one floating point number and one string, then go away.  When I
> > > come back to this buffer to append another value, does the buffer
> itself
> > > know where I would begin?  I.e. is there a differentiation in the
> column
> > > (or blob) data itself between the available space and the used space?
> > >
> > > Suppose I write a lot of large variable width strings and “run out” of
> > > space for them before running out of space for floating point numbers
> or
> > > string pointers.  (I guessed badly when doing the original
> allocation.). I
> > > consider this to be Ok since I can always “copy” the data to “compress
> out”
> > > the unused fp/pointer buckets... the choice is up to me.
> > >
> > > The above applied to a (feather?) file is how I anticipate appending
> data
> > > to disk... pre-allocate a mem-mapped file and gradually fill it up.
> The
> > > efficiency of file utilization will depend on my projections regarding
> > > variable-width data types, but as I said above, I can always re-write
> the
> > > file if/when this bothers me.
> > >
> > > Is this the recommended and supported approach for incremental appends?
> > > I’m really hoping to use Arrow instead of rolling my own, but
> functionality
> > > like this is absolutely key!  Hoping not to use a side-car file (or
> memory
> > > chunk) to store “append progress” information.
> > >
> > > I am brand new to this project so please forgive me if I have
> overlooked
> > > something obvious.  And again, looks like great work so far!
> > >
> > > Thanks!
> > > -John
> > >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message