arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: Stored state of incremental writes to fixed size Arrow buffer?
Date Mon, 06 May 2019 14:49:32 GMT
hi John,

Feel free to open some JIRA issues to make a specific proposal about
what you want to see in the libraries

I would recommend not coupling yourself to the Feather format as it
stands now, as I would like to change it as soon as > 90% of R users
can successfully install the Arrow libraries (they cannot at present,
so I've been holding off on doing more there)

- Wes

On Mon, May 6, 2019 at 9:39 AM John Muehlhausen <jgm@jgm.org> wrote:
>
> Wes,
>
> I’m not afraid of writing my own C++ code to deal with all of this on the
> writer side.  I just need a way to “append” (incrementally populate) e.g.
> feather files so that a person using e.g. pyarrow doesn’t suffer some
> catastrophic failure... and “on the side” I tell them which rows are junk
> and deal with any concurrency issues that can’t be solved in the arena of
> atomicity and ordering of ops.  For now I care about basic types but
> including variable-width strings.
>
> For event-processing, I think Arrow has to have the concept of a partially
> full record set.  Some alternatives are:
> - have a batch size of one, thus littering the landscape with trivially
> small Arrow buffers
> - artificially increase latency with a batch size larger than one, but not
> processing any data until a batch is complete
> - continuously re-write the (entire!) “main” buffer as batches of length 1
> roll in
> - instead of one main buffer, several, and at some threshold combine the
> last N length-1 batches into a length N buffer ... still an inefficiency
>
> Consider the case of QAbstractTableModel as the underlying data for a table
> or a chart.  This visualization shows all of the data for the recent past
> as well as events rolling in.  If this model interface is implemented as a
> view onto “many thousands” of individual event buffers then we gain nothing
> from columnar layout.  (Suppose there are tons of columns and most of them
> are scrolled out of the view.). Likewise we cannot re-write the entire
> model on each event... time complexity blows up.  What we want is to have a
> large pre-allocated chunk and just change rowCount() as data is “appended.”
>  Sure, we may run out of space and have another and another chunk for
> future row ranges, but a handful of chunks chained together is better than
> as many chunks as there were events!
>
> And again, having a batch size >1 and delaying the data until a batch is
> full is a non-starter.
>
> I am really hoping to see partially-filled buffers as something we keep our
> finger on moving forward!  Or else, what am I missing?
>
> -John
>
> On Mon, May 6, 2019 at 8:24 AM Wes McKinney <wesmckinn@gmail.com> wrote:
>
> > hi John,
> >
> > In C++ the builder classes don't yet support writing into preallocated
> > memory. It would be tricky for applications to determine a priori
> > which segments of memory to pass to the builder. It seems only
> > feasible for primitive / fixed-size types so my guess would be that a
> > separate set of interfaces would need to be developed for this task.
> >
> > - Wes
> >
> > On Mon, May 6, 2019 at 8:18 AM Jacques Nadeau <jacques@apache.org> wrote:
> > >
> > > This is more of a question of implementation versus specification. An
> > arrow
> > > buffer is generally built and then sealed. In different languages, this
> > > building process works differently (a concern of the language rather than
> > > the memory specification). We don't currently allow a half built vector
> > to
> > > be moved to another language and then be further built. So the question
> > is
> > > really more concrete: what language are you looking at and what is the
> > > specific pattern you're trying to undertake for building.
> > >
> > > If you're trying to go across independent processes (whether the same
> > > process restarted or two separate processes active simultaneously) you'll
> > > need to build up your own data structures to help with this.
> > >
> > > On Mon, May 6, 2019 at 6:28 PM John Muehlhausen <jgm@jgm.org> wrote:
> > >
> > > > Hello,
> > > >
> > > > Glad to learn of this project— good work!
> > > >
> > > > If I allocate a single chunk of memory and start building Arrow format
> > > > within it, does this chunk save any state regarding my progress?
> > > >
> > > > For example, suppose I allocate a column for floating point (fixed
> > width)
> > > > and a column for string (variable width).  Suppose I start building the
> > > > floating point column at offset X into my single buffer, and the string
> > > > “pointer” column at offset Y into the same single buffer, and the
> > string
> > > > data elements at offset Z.
> > > >
> > > > I write one floating point number and one string, then go away.  When
I
> > > > come back to this buffer to append another value, does the buffer
> > itself
> > > > know where I would begin?  I.e. is there a differentiation in the
> > column
> > > > (or blob) data itself between the available space and the used space?
> > > >
> > > > Suppose I write a lot of large variable width strings and “run out”
of
> > > > space for them before running out of space for floating point numbers
> > or
> > > > string pointers.  (I guessed badly when doing the original
> > allocation.). I
> > > > consider this to be Ok since I can always “copy” the data to “compress
> > out”
> > > > the unused fp/pointer buckets... the choice is up to me.
> > > >
> > > > The above applied to a (feather?) file is how I anticipate appending
> > data
> > > > to disk... pre-allocate a mem-mapped file and gradually fill it up.
> > The
> > > > efficiency of file utilization will depend on my projections regarding
> > > > variable-width data types, but as I said above, I can always re-write
> > the
> > > > file if/when this bothers me.
> > > >
> > > > Is this the recommended and supported approach for incremental appends?
> > > > I’m really hoping to use Arrow instead of rolling my own, but
> > functionality
> > > > like this is absolutely key!  Hoping not to use a side-car file (or
> > memory
> > > > chunk) to store “append progress” information.
> > > >
> > > > I am brand new to this project so please forgive me if I have
> > overlooked
> > > > something obvious.  And again, looks like great work so far!
> > > >
> > > > Thanks!
> > > > -John
> > > >
> >

Mime
View raw message