arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: Stored state of incremental writes to fixed size Arrow buffer?
Date Mon, 06 May 2019 16:39:03 GMT
hi John -- again, I would caution you against using Feather files for
issues of longevity -- the internal memory layout of those files is a
"dead man walking" so to speak.

I would advise against forking the project, IMHO that is a dark path
that leads nowhere good. We have a large community here and we accept
pull requests -- I think the challenge is going to be defining the use
case to suitable clarity that a general purpose solution can be
developed.

- Wes


On Mon, May 6, 2019 at 11:16 AM John Muehlhausen <jgm@jgm.org> wrote:
>
> François, Wes,
>
> Thanks for the feedback.  I think the most practical thing for me to do is
> 1- write a Feather file that is structured to pre-allocate the space I need
> (e.g. initial variable-length strings are of average size)
> 2- come up with code to monkey around with the values contained in the
> vectors so that before and after each manipulation the file is valid as I
> walk the rows ... this is a writer that uses memory mapping
> 3- check back in here once that works, assuming that it does, to see if we
> can bless certain mutation paths
> 4- if we can't bless certain mutation paths, fork the project for those who
> care more about stream processing?  That would not seem to be ideal as I
> think mutation in row-order across the data set is relatively low impact on
> the overall design?
>
> Thanks again for engaging the topic!
> -John
>
> On Mon, May 6, 2019 at 10:26 AM Francois Saint-Jacques <
> fsaintjacques@gmail.com> wrote:
>
> > Hello John,
> >
> > Arrow is not yet suited for partial writes. The specification only
> > talks about fully frozen/immutable objects, you're in implementation
> > defined territory here. For example, the C++ library assumes the Array
> > object is immutable; it memoize the null count, and likely more
> > statistics in the future.
> >
> > If you want to use pre-allocated buffers and array, you can use the
> > column validity bitmap for this purpose, e.g. set all null by default
> > and flip once the row is written. It suffers from concurrency issues
> > (+ invalidation issues as pointed) when dealing with multiple columns.
> > You'll have to use a barrier of some kind, e.g. a per-batch global
> > atomic (if append-only), or dedicated column(s) à-la MVCC. But then,
> > the reader needs to be aware of this and compute a mask each time it
> > needs to query the partial batch.
> >
> > This is a common columnar database problem, see [1] for a recent paper
> > on the subject. The usual technique is to store the recent data
> > row-wise, and transform it in column-wise when a threshold is met akin
> > to a compaction phase. There was a somewhat related thread [2] lately
> > about streaming vs batching. In the end, I think your solution will be
> > very application specific.
> >
> > François
> >
> > [1] https://db.in.tum.de/downloads/publications/datablocks.pdf
> > [2]
> > https://lists.apache.org/thread.html/27945533db782361143586fd77ca08e15e96e2f2a5250ff084b462d6@%3Cdev.arrow.apache.org%3E
> >
> >
> >
> >
> >
> >
> >
> > On Mon, May 6, 2019 at 10:39 AM John Muehlhausen <jgm@jgm.org> wrote:
> > >
> > > Wes,
> > >
> > > I’m not afraid of writing my own C++ code to deal with all of this on the
> > > writer side.  I just need a way to “append” (incrementally populate) e.g.
> > > feather files so that a person using e.g. pyarrow doesn’t suffer some
> > > catastrophic failure... and “on the side” I tell them which rows are junk
> > > and deal with any concurrency issues that can’t be solved in the arena of
> > > atomicity and ordering of ops.  For now I care about basic types but
> > > including variable-width strings.
> > >
> > > For event-processing, I think Arrow has to have the concept of a
> > partially
> > > full record set.  Some alternatives are:
> > > - have a batch size of one, thus littering the landscape with trivially
> > > small Arrow buffers
> > > - artificially increase latency with a batch size larger than one, but
> > not
> > > processing any data until a batch is complete
> > > - continuously re-write the (entire!) “main” buffer as batches of length
> > 1
> > > roll in
> > > - instead of one main buffer, several, and at some threshold combine the
> > > last N length-1 batches into a length N buffer ... still an inefficiency
> > >
> > > Consider the case of QAbstractTableModel as the underlying data for a
> > table
> > > or a chart.  This visualization shows all of the data for the recent past
> > > as well as events rolling in.  If this model interface is implemented as
> > a
> > > view onto “many thousands” of individual event buffers then we gain
> > nothing
> > > from columnar layout.  (Suppose there are tons of columns and most of
> > them
> > > are scrolled out of the view.). Likewise we cannot re-write the entire
> > > model on each event... time complexity blows up.  What we want is to
> > have a
> > > large pre-allocated chunk and just change rowCount() as data is
> > “appended.”
> > >  Sure, we may run out of space and have another and another chunk for
> > > future row ranges, but a handful of chunks chained together is better
> > than
> > > as many chunks as there were events!
> > >
> > > And again, having a batch size >1 and delaying the data until a batch is
> > > full is a non-starter.
> > >
> > > I am really hoping to see partially-filled buffers as something we keep
> > our
> > > finger on moving forward!  Or else, what am I missing?
> > >
> > > -John
> > >
> > > On Mon, May 6, 2019 at 8:24 AM Wes McKinney <wesmckinn@gmail.com> wrote:
> > >
> > > > hi John,
> > > >
> > > > In C++ the builder classes don't yet support writing into preallocated
> > > > memory. It would be tricky for applications to determine a priori
> > > > which segments of memory to pass to the builder. It seems only
> > > > feasible for primitive / fixed-size types so my guess would be that a
> > > > separate set of interfaces would need to be developed for this task.
> > > >
> > > > - Wes
> > > >
> > > > On Mon, May 6, 2019 at 8:18 AM Jacques Nadeau <jacques@apache.org>
> > wrote:
> > > > >
> > > > > This is more of a question of implementation versus specification.
An
> > > > arrow
> > > > > buffer is generally built and then sealed. In different languages,
> > this
> > > > > building process works differently (a concern of the language rather
> > than
> > > > > the memory specification). We don't currently allow a half built
> > vector
> > > > to
> > > > > be moved to another language and then be further built. So the
> > question
> > > > is
> > > > > really more concrete: what language are you looking at and what is
> > the
> > > > > specific pattern you're trying to undertake for building.
> > > > >
> > > > > If you're trying to go across independent processes (whether the
same
> > > > > process restarted or two separate processes active simultaneously)
> > you'll
> > > > > need to build up your own data structures to help with this.
> > > > >
> > > > > On Mon, May 6, 2019 at 6:28 PM John Muehlhausen <jgm@jgm.org>
wrote:
> > > > >
> > > > > > Hello,
> > > > > >
> > > > > > Glad to learn of this project— good work!
> > > > > >
> > > > > > If I allocate a single chunk of memory and start building Arrow
> > format
> > > > > > within it, does this chunk save any state regarding my progress?
> > > > > >
> > > > > > For example, suppose I allocate a column for floating point
(fixed
> > > > width)
> > > > > > and a column for string (variable width).  Suppose I start
> > building the
> > > > > > floating point column at offset X into my single buffer, and
the
> > string
> > > > > > “pointer” column at offset Y into the same single buffer,
and the
> > > > string
> > > > > > data elements at offset Z.
> > > > > >
> > > > > > I write one floating point number and one string, then go away.
> > When I
> > > > > > come back to this buffer to append another value, does the buffer
> > > > itself
> > > > > > know where I would begin?  I.e. is there a differentiation in
the
> > > > column
> > > > > > (or blob) data itself between the available space and the used
> > space?
> > > > > >
> > > > > > Suppose I write a lot of large variable width strings and “run
> > out” of
> > > > > > space for them before running out of space for floating point
> > numbers
> > > > or
> > > > > > string pointers.  (I guessed badly when doing the original
> > > > allocation.). I
> > > > > > consider this to be Ok since I can always “copy” the data
to
> > “compress
> > > > out”
> > > > > > the unused fp/pointer buckets... the choice is up to me.
> > > > > >
> > > > > > The above applied to a (feather?) file is how I anticipate
> > appending
> > > > data
> > > > > > to disk... pre-allocate a mem-mapped file and gradually fill
it up.
> > > > The
> > > > > > efficiency of file utilization will depend on my projections
> > regarding
> > > > > > variable-width data types, but as I said above, I can always
> > re-write
> > > > the
> > > > > > file if/when this bothers me.
> > > > > >
> > > > > > Is this the recommended and supported approach for incremental
> > appends?
> > > > > > I’m really hoping to use Arrow instead of rolling my own,
but
> > > > functionality
> > > > > > like this is absolutely key!  Hoping not to use a side-car file
(or
> > > > memory
> > > > > > chunk) to store “append progress” information.
> > > > > >
> > > > > > I am brand new to this project so please forgive me if I have
> > > > overlooked
> > > > > > something obvious.  And again, looks like great work so far!
> > > > > >
> > > > > > Thanks!
> > > > > > -John
> > > > > >
> > > >
> >

Mime
View raw message