arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Muehlhausen <...@jgm.org>
Subject Re: Stored state of incremental writes to fixed size Arrow buffer?
Date Mon, 06 May 2019 16:16:09 GMT
François, Wes,

Thanks for the feedback.  I think the most practical thing for me to do is
1- write a Feather file that is structured to pre-allocate the space I need
(e.g. initial variable-length strings are of average size)
2- come up with code to monkey around with the values contained in the
vectors so that before and after each manipulation the file is valid as I
walk the rows ... this is a writer that uses memory mapping
3- check back in here once that works, assuming that it does, to see if we
can bless certain mutation paths
4- if we can't bless certain mutation paths, fork the project for those who
care more about stream processing?  That would not seem to be ideal as I
think mutation in row-order across the data set is relatively low impact on
the overall design?

Thanks again for engaging the topic!
-John

On Mon, May 6, 2019 at 10:26 AM Francois Saint-Jacques <
fsaintjacques@gmail.com> wrote:

> Hello John,
>
> Arrow is not yet suited for partial writes. The specification only
> talks about fully frozen/immutable objects, you're in implementation
> defined territory here. For example, the C++ library assumes the Array
> object is immutable; it memoize the null count, and likely more
> statistics in the future.
>
> If you want to use pre-allocated buffers and array, you can use the
> column validity bitmap for this purpose, e.g. set all null by default
> and flip once the row is written. It suffers from concurrency issues
> (+ invalidation issues as pointed) when dealing with multiple columns.
> You'll have to use a barrier of some kind, e.g. a per-batch global
> atomic (if append-only), or dedicated column(s) à-la MVCC. But then,
> the reader needs to be aware of this and compute a mask each time it
> needs to query the partial batch.
>
> This is a common columnar database problem, see [1] for a recent paper
> on the subject. The usual technique is to store the recent data
> row-wise, and transform it in column-wise when a threshold is met akin
> to a compaction phase. There was a somewhat related thread [2] lately
> about streaming vs batching. In the end, I think your solution will be
> very application specific.
>
> François
>
> [1] https://db.in.tum.de/downloads/publications/datablocks.pdf
> [2]
> https://lists.apache.org/thread.html/27945533db782361143586fd77ca08e15e96e2f2a5250ff084b462d6@%3Cdev.arrow.apache.org%3E
>
>
>
>
>
>
>
> On Mon, May 6, 2019 at 10:39 AM John Muehlhausen <jgm@jgm.org> wrote:
> >
> > Wes,
> >
> > I’m not afraid of writing my own C++ code to deal with all of this on the
> > writer side.  I just need a way to “append” (incrementally populate) e.g.
> > feather files so that a person using e.g. pyarrow doesn’t suffer some
> > catastrophic failure... and “on the side” I tell them which rows are junk
> > and deal with any concurrency issues that can’t be solved in the arena of
> > atomicity and ordering of ops.  For now I care about basic types but
> > including variable-width strings.
> >
> > For event-processing, I think Arrow has to have the concept of a
> partially
> > full record set.  Some alternatives are:
> > - have a batch size of one, thus littering the landscape with trivially
> > small Arrow buffers
> > - artificially increase latency with a batch size larger than one, but
> not
> > processing any data until a batch is complete
> > - continuously re-write the (entire!) “main” buffer as batches of length
> 1
> > roll in
> > - instead of one main buffer, several, and at some threshold combine the
> > last N length-1 batches into a length N buffer ... still an inefficiency
> >
> > Consider the case of QAbstractTableModel as the underlying data for a
> table
> > or a chart.  This visualization shows all of the data for the recent past
> > as well as events rolling in.  If this model interface is implemented as
> a
> > view onto “many thousands” of individual event buffers then we gain
> nothing
> > from columnar layout.  (Suppose there are tons of columns and most of
> them
> > are scrolled out of the view.). Likewise we cannot re-write the entire
> > model on each event... time complexity blows up.  What we want is to
> have a
> > large pre-allocated chunk and just change rowCount() as data is
> “appended.”
> >  Sure, we may run out of space and have another and another chunk for
> > future row ranges, but a handful of chunks chained together is better
> than
> > as many chunks as there were events!
> >
> > And again, having a batch size >1 and delaying the data until a batch is
> > full is a non-starter.
> >
> > I am really hoping to see partially-filled buffers as something we keep
> our
> > finger on moving forward!  Or else, what am I missing?
> >
> > -John
> >
> > On Mon, May 6, 2019 at 8:24 AM Wes McKinney <wesmckinn@gmail.com> wrote:
> >
> > > hi John,
> > >
> > > In C++ the builder classes don't yet support writing into preallocated
> > > memory. It would be tricky for applications to determine a priori
> > > which segments of memory to pass to the builder. It seems only
> > > feasible for primitive / fixed-size types so my guess would be that a
> > > separate set of interfaces would need to be developed for this task.
> > >
> > > - Wes
> > >
> > > On Mon, May 6, 2019 at 8:18 AM Jacques Nadeau <jacques@apache.org>
> wrote:
> > > >
> > > > This is more of a question of implementation versus specification. An
> > > arrow
> > > > buffer is generally built and then sealed. In different languages,
> this
> > > > building process works differently (a concern of the language rather
> than
> > > > the memory specification). We don't currently allow a half built
> vector
> > > to
> > > > be moved to another language and then be further built. So the
> question
> > > is
> > > > really more concrete: what language are you looking at and what is
> the
> > > > specific pattern you're trying to undertake for building.
> > > >
> > > > If you're trying to go across independent processes (whether the same
> > > > process restarted or two separate processes active simultaneously)
> you'll
> > > > need to build up your own data structures to help with this.
> > > >
> > > > On Mon, May 6, 2019 at 6:28 PM John Muehlhausen <jgm@jgm.org> wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > Glad to learn of this project— good work!
> > > > >
> > > > > If I allocate a single chunk of memory and start building Arrow
> format
> > > > > within it, does this chunk save any state regarding my progress?
> > > > >
> > > > > For example, suppose I allocate a column for floating point (fixed
> > > width)
> > > > > and a column for string (variable width).  Suppose I start
> building the
> > > > > floating point column at offset X into my single buffer, and the
> string
> > > > > “pointer” column at offset Y into the same single buffer, and
the
> > > string
> > > > > data elements at offset Z.
> > > > >
> > > > > I write one floating point number and one string, then go away.
> When I
> > > > > come back to this buffer to append another value, does the buffer
> > > itself
> > > > > know where I would begin?  I.e. is there a differentiation in the
> > > column
> > > > > (or blob) data itself between the available space and the used
> space?
> > > > >
> > > > > Suppose I write a lot of large variable width strings and “run
> out” of
> > > > > space for them before running out of space for floating point
> numbers
> > > or
> > > > > string pointers.  (I guessed badly when doing the original
> > > allocation.). I
> > > > > consider this to be Ok since I can always “copy” the data to
> “compress
> > > out”
> > > > > the unused fp/pointer buckets... the choice is up to me.
> > > > >
> > > > > The above applied to a (feather?) file is how I anticipate
> appending
> > > data
> > > > > to disk... pre-allocate a mem-mapped file and gradually fill it up.
> > > The
> > > > > efficiency of file utilization will depend on my projections
> regarding
> > > > > variable-width data types, but as I said above, I can always
> re-write
> > > the
> > > > > file if/when this bothers me.
> > > > >
> > > > > Is this the recommended and supported approach for incremental
> appends?
> > > > > I’m really hoping to use Arrow instead of rolling my own, but
> > > functionality
> > > > > like this is absolutely key!  Hoping not to use a side-car file (or
> > > memory
> > > > > chunk) to store “append progress” information.
> > > > >
> > > > > I am brand new to this project so please forgive me if I have
> > > overlooked
> > > > > something obvious.  And again, looks like great work so far!
> > > > >
> > > > > Thanks!
> > > > > -John
> > > > >
> > >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message