arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Francois Saint-Jacques <fsaintjacq...@gmail.com>
Subject Re: Stored state of incremental writes to fixed size Arrow buffer?
Date Mon, 06 May 2019 15:25:40 GMT
Hello John,

Arrow is not yet suited for partial writes. The specification only
talks about fully frozen/immutable objects, you're in implementation
defined territory here. For example, the C++ library assumes the Array
object is immutable; it memoize the null count, and likely more
statistics in the future.

If you want to use pre-allocated buffers and array, you can use the
column validity bitmap for this purpose, e.g. set all null by default
and flip once the row is written. It suffers from concurrency issues
(+ invalidation issues as pointed) when dealing with multiple columns.
You'll have to use a barrier of some kind, e.g. a per-batch global
atomic (if append-only), or dedicated column(s) à-la MVCC. But then,
the reader needs to be aware of this and compute a mask each time it
needs to query the partial batch.

This is a common columnar database problem, see [1] for a recent paper
on the subject. The usual technique is to store the recent data
row-wise, and transform it in column-wise when a threshold is met akin
to a compaction phase. There was a somewhat related thread [2] lately
about streaming vs batching. In the end, I think your solution will be
very application specific.

François

[1] https://db.in.tum.de/downloads/publications/datablocks.pdf
[2] https://lists.apache.org/thread.html/27945533db782361143586fd77ca08e15e96e2f2a5250ff084b462d6@%3Cdev.arrow.apache.org%3E







On Mon, May 6, 2019 at 10:39 AM John Muehlhausen <jgm@jgm.org> wrote:
>
> Wes,
>
> I’m not afraid of writing my own C++ code to deal with all of this on the
> writer side.  I just need a way to “append” (incrementally populate) e.g.
> feather files so that a person using e.g. pyarrow doesn’t suffer some
> catastrophic failure... and “on the side” I tell them which rows are junk
> and deal with any concurrency issues that can’t be solved in the arena of
> atomicity and ordering of ops.  For now I care about basic types but
> including variable-width strings.
>
> For event-processing, I think Arrow has to have the concept of a partially
> full record set.  Some alternatives are:
> - have a batch size of one, thus littering the landscape with trivially
> small Arrow buffers
> - artificially increase latency with a batch size larger than one, but not
> processing any data until a batch is complete
> - continuously re-write the (entire!) “main” buffer as batches of length 1
> roll in
> - instead of one main buffer, several, and at some threshold combine the
> last N length-1 batches into a length N buffer ... still an inefficiency
>
> Consider the case of QAbstractTableModel as the underlying data for a table
> or a chart.  This visualization shows all of the data for the recent past
> as well as events rolling in.  If this model interface is implemented as a
> view onto “many thousands” of individual event buffers then we gain nothing
> from columnar layout.  (Suppose there are tons of columns and most of them
> are scrolled out of the view.). Likewise we cannot re-write the entire
> model on each event... time complexity blows up.  What we want is to have a
> large pre-allocated chunk and just change rowCount() as data is “appended.”
>  Sure, we may run out of space and have another and another chunk for
> future row ranges, but a handful of chunks chained together is better than
> as many chunks as there were events!
>
> And again, having a batch size >1 and delaying the data until a batch is
> full is a non-starter.
>
> I am really hoping to see partially-filled buffers as something we keep our
> finger on moving forward!  Or else, what am I missing?
>
> -John
>
> On Mon, May 6, 2019 at 8:24 AM Wes McKinney <wesmckinn@gmail.com> wrote:
>
> > hi John,
> >
> > In C++ the builder classes don't yet support writing into preallocated
> > memory. It would be tricky for applications to determine a priori
> > which segments of memory to pass to the builder. It seems only
> > feasible for primitive / fixed-size types so my guess would be that a
> > separate set of interfaces would need to be developed for this task.
> >
> > - Wes
> >
> > On Mon, May 6, 2019 at 8:18 AM Jacques Nadeau <jacques@apache.org> wrote:
> > >
> > > This is more of a question of implementation versus specification. An
> > arrow
> > > buffer is generally built and then sealed. In different languages, this
> > > building process works differently (a concern of the language rather than
> > > the memory specification). We don't currently allow a half built vector
> > to
> > > be moved to another language and then be further built. So the question
> > is
> > > really more concrete: what language are you looking at and what is the
> > > specific pattern you're trying to undertake for building.
> > >
> > > If you're trying to go across independent processes (whether the same
> > > process restarted or two separate processes active simultaneously) you'll
> > > need to build up your own data structures to help with this.
> > >
> > > On Mon, May 6, 2019 at 6:28 PM John Muehlhausen <jgm@jgm.org> wrote:
> > >
> > > > Hello,
> > > >
> > > > Glad to learn of this project— good work!
> > > >
> > > > If I allocate a single chunk of memory and start building Arrow format
> > > > within it, does this chunk save any state regarding my progress?
> > > >
> > > > For example, suppose I allocate a column for floating point (fixed
> > width)
> > > > and a column for string (variable width).  Suppose I start building the
> > > > floating point column at offset X into my single buffer, and the string
> > > > “pointer” column at offset Y into the same single buffer, and the
> > string
> > > > data elements at offset Z.
> > > >
> > > > I write one floating point number and one string, then go away.  When
I
> > > > come back to this buffer to append another value, does the buffer
> > itself
> > > > know where I would begin?  I.e. is there a differentiation in the
> > column
> > > > (or blob) data itself between the available space and the used space?
> > > >
> > > > Suppose I write a lot of large variable width strings and “run out”
of
> > > > space for them before running out of space for floating point numbers
> > or
> > > > string pointers.  (I guessed badly when doing the original
> > allocation.). I
> > > > consider this to be Ok since I can always “copy” the data to “compress
> > out”
> > > > the unused fp/pointer buckets... the choice is up to me.
> > > >
> > > > The above applied to a (feather?) file is how I anticipate appending
> > data
> > > > to disk... pre-allocate a mem-mapped file and gradually fill it up.
> > The
> > > > efficiency of file utilization will depend on my projections regarding
> > > > variable-width data types, but as I said above, I can always re-write
> > the
> > > > file if/when this bothers me.
> > > >
> > > > Is this the recommended and supported approach for incremental appends?
> > > > I’m really hoping to use Arrow instead of rolling my own, but
> > functionality
> > > > like this is absolutely key!  Hoping not to use a side-car file (or
> > memory
> > > > chunk) to store “append progress” information.
> > > >
> > > > I am brand new to this project so please forgive me if I have
> > overlooked
> > > > something obvious.  And again, looks like great work so far!
> > > >
> > > > Thanks!
> > > > -John
> > > >
> >

Mime
View raw message