arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Muehlhausen <...@jgm.org>
Subject Re: Stored state of incremental writes to fixed size Arrow buffer?
Date Tue, 07 May 2019 16:24:55 GMT
Thanks Wes:

"the current Feather format is deprecated" ... yes, but there will be a
future file format that replaces it, correct?  And my discussion of
immutable "portions" of Arrow buffers, rather than immutability of the
entire buffer, applies to IPC as well, right?  I am only championing the
idea that this future file format have the convenience that certain
preallocated rows can be ignored based on a metadata setting.

"I recommend batching your writes" ... this is extremely inefficient and
adds unacceptable latency, relative to the proposed solution.  Do you
disagree?  Either I have a batch length of 1 to minimize latency, which
eliminates columnar advantages on the read side, or else I add latency.
Neither works, and it seems that a viable alternative is within sight?

If you don't agree that there is a core issue and opportunity here, I'm not
sure how to better make my case....

-John

On Tue, May 7, 2019 at 11:02 AM Wes McKinney <wesmckinn@gmail.com> wrote:

> hi John,
>
> On Tue, May 7, 2019 at 10:53 AM John Muehlhausen <jgm@jgm.org> wrote:
> >
> > Wes et al, I completed a preliminary study of populating a Feather file
> > incrementally.  Some notes and questions:
> >
> > I wrote the following dataframe to a feather file:
> >             a    b
> > 0  0123456789  0.0
> > 1  0123456789  NaN
> > 2  0123456789  NaN
> > 3  0123456789  NaN
> > 4        None  NaN
> >
> > In re-writing the flatbuffers metadata (flatc -p doesn't
> > support --gen-mutable! yuck! C++ to the rescue...), it seems that
> > read_feather is not affected by NumRows?  It seems to be driven entirely
> by
> > the per-column Length values?
> >
> > Also, it seems as if one of the primary usages of NullCount is to
> determine
> > whether or not a bitfield is present?  In the initialization data above I
> > was careful to have a null value in each column in order to generate a
> > bitfield.
>
> Per my prior e-mails, the current Feather format is deprecated, so I'm
> only willing to engage on a discussion of the official Arrow binary
> protocol that we use for IPC (memory mapping) and RPC (Flight).
>
> >
> > I then wiped the bitfields in the file and set all of the string indices
> to
> > one past the end of the blob buffer (all strings empty):
> >       a   b
> > 0  None NaN
> > 1  None NaN
> > 2  None NaN
> > 3  None NaN
> > 4  None NaN
> >
> > I then set the first record to some data by consuming some of the string
> > blob and row 0 and 1 indices, also setting the double:
> >
> >                a    b
> > 0  Hello, world!  5.0
> > 1           None  NaN
> > 2           None  NaN
> > 3           None  NaN
> > 4           None  NaN
> >
> > As mentioned above, NumRows seems to be ignored.  I tried adjusting each
> > column Length to mask off higher rows and it worked for 4 (hide last row)
> > but not for less than 4.  I take this to be due to math used to figure
> out
> > where the buffers are relative to one another since there is only one
> > metadata offset for all of: the (optional) bitset, index column and (if
> > string) blobs.
> >
> > Populating subsequent rows would proceed in a similar way until all of
> the
> > blob storage has been consumed, which may come before the pre-allocated
> > rows have been consumed.
> >
> > So what does this mean for my desire to incrementally write these
> > (potentially memory-mapped) pre-allocated files and/or Arrow buffers in
> > memory?  Some thoughts:
> >
> > - If a single value (such as NumRows) were consulted to determine the
> table
> > row dimension then updating this single value would serve to tell a
> reader
> > which rows are relevant.  Assuming this value is updated after all other
> > mutations are complete, and assuming that mutations are only appends
> > (addition of rows), then concurrency control involves only ensuring that
> > this value is not examined while it is being written.
> >
> > - NullCount presents a concurrency problem if someone reads the file
> after
> > this field has been updated, but before NumRows has exposed the new
> record
> > (or vice versa).  The idea previously mentioned that there will "likely
> > [be] more statistics in the future" feels like it might be scope creep to
> > me?  This is a data representation, not a calculation framework?  If
> > NullCount had its genesis in the optional nature of the bitfield, I would
> > suggest that perhaps NullCount can be dropped in favor of always
> supplying
> > the bitfield, which in any event is already contemplated by the spec:
> > "Implementations may choose to always allocate one anyway as a matter of
> > convenience."  If the concern is space savings, Arrow is already an
> > extremely uncompressed format.  (Compression is something I would also
> > consider to be scope creep as regards Feather... compressed filesystems
> can
> > be employed and there are other compressed dataframe formats.)  However,
> if
> > protecting the 4 bytes required to update NowRows turns out to be no
> easier
> > than protecting all of the statistical bytes as well as part of the same
> > "critical section" (locks: yuck!!) then statistics pose no issue.  I
> have a
> > feeling that the availability of an atomic write of 4 bytes will depend
> on
> > the storage mechanism... memory vs memory map vs write() etc.
> >
> > - The elephant in the room appears to be the presumptive binary yes/no on
> > mutability of Arrow buffers.  Perhaps the thought is that certain batch
> > processes will be wrecked if anyone anywhere is mutating buffers in any
> > way.  But keep in mind I am not proposing general mutability, only
> > appending of new data.  *A huge amount of batch processing that will take
> > place with Arrow is on time-series data (whether financial or otherwise).
> > It is only natural that architects will want the minimal impedance
> mismatch
> > when it comes time to grow those time series as the events occur going
> > forward.*  So rather than say that I want "mutable" Arrow buffers, I
> would
> > pitch this as a call for "immutable populated areas" of Arrow buffers
> > combined with the concept that the populated area can grow up to whatever
> > was preallocated.  This will not affect anyone who has "memoized" a
> > dimension and wants to continue to consider the then-current data as
> > immutable... it will be immutable and will always be immutable according
> to
> > that then-current dimension.
> >
> > Thanks in advance for considering this feedback!  I absolutely require
> > efficient row-wise growth of an Arrow-like buffer to deal with time
> series
> > data in near real time.  I believe that preallocation is (by far) the
> most
> > efficient way to accomplish this.  I hope to be able to use Arrow!  If I
> > cannot use Arrow than I will be using a home-grown Arrow that is
> identical
> > except for this feature, which would be very sad!  Even if Arrow itself
> > could be used in this manner today, I would be hesitant to use it if the
> > use-case was not protected on a go-forward basis.
> >
>
> I recommend batching your writes and using the Arrow binary streaming
> protocol so you are only appending to a file rather than mutating
> previously-written bytes. This use case is well defined and supported
> in the software already.
>
>
> https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst#streaming-format
>
> - Wes
>
> > Of course, I am completely open to alternative ideas and approaches!
> >
> > -John
> >
> > On Mon, May 6, 2019 at 11:39 AM Wes McKinney <wesmckinn@gmail.com>
> wrote:
> >
> > > hi John -- again, I would caution you against using Feather files for
> > > issues of longevity -- the internal memory layout of those files is a
> > > "dead man walking" so to speak.
> > >
> > > I would advise against forking the project, IMHO that is a dark path
> > > that leads nowhere good. We have a large community here and we accept
> > > pull requests -- I think the challenge is going to be defining the use
> > > case to suitable clarity that a general purpose solution can be
> > > developed.
> > >
> > > - Wes
> > >
> > >
> > > On Mon, May 6, 2019 at 11:16 AM John Muehlhausen <jgm@jgm.org> wrote:
> > > >
> > > > François, Wes,
> > > >
> > > > Thanks for the feedback.  I think the most practical thing for me to
> do
> > > is
> > > > 1- write a Feather file that is structured to pre-allocate the space
> I
> > > need
> > > > (e.g. initial variable-length strings are of average size)
> > > > 2- come up with code to monkey around with the values contained in
> the
> > > > vectors so that before and after each manipulation the file is valid
> as I
> > > > walk the rows ... this is a writer that uses memory mapping
> > > > 3- check back in here once that works, assuming that it does, to see
> if
> > > we
> > > > can bless certain mutation paths
> > > > 4- if we can't bless certain mutation paths, fork the project for
> those
> > > who
> > > > care more about stream processing?  That would not seem to be ideal
> as I
> > > > think mutation in row-order across the data set is relatively low
> impact
> > > on
> > > > the overall design?
> > > >
> > > > Thanks again for engaging the topic!
> > > > -John
> > > >
> > > > On Mon, May 6, 2019 at 10:26 AM Francois Saint-Jacques <
> > > > fsaintjacques@gmail.com> wrote:
> > > >
> > > > > Hello John,
> > > > >
> > > > > Arrow is not yet suited for partial writes. The specification only
> > > > > talks about fully frozen/immutable objects, you're in
> implementation
> > > > > defined territory here. For example, the C++ library assumes the
> Array
> > > > > object is immutable; it memoize the null count, and likely more
> > > > > statistics in the future.
> > > > >
> > > > > If you want to use pre-allocated buffers and array, you can use the
> > > > > column validity bitmap for this purpose, e.g. set all null by
> default
> > > > > and flip once the row is written. It suffers from concurrency
> issues
> > > > > (+ invalidation issues as pointed) when dealing with multiple
> columns.
> > > > > You'll have to use a barrier of some kind, e.g. a per-batch global
> > > > > atomic (if append-only), or dedicated column(s) à-la MVCC. But
> then,
> > > > > the reader needs to be aware of this and compute a mask each time
> it
> > > > > needs to query the partial batch.
> > > > >
> > > > > This is a common columnar database problem, see [1] for a recent
> paper
> > > > > on the subject. The usual technique is to store the recent data
> > > > > row-wise, and transform it in column-wise when a threshold is met
> akin
> > > > > to a compaction phase. There was a somewhat related thread [2]
> lately
> > > > > about streaming vs batching. In the end, I think your solution
> will be
> > > > > very application specific.
> > > > >
> > > > > François
> > > > >
> > > > > [1] https://db.in.tum.de/downloads/publications/datablocks.pdf
> > > > > [2]
> > > > >
> > >
> https://lists.apache.org/thread.html/27945533db782361143586fd77ca08e15e96e2f2a5250ff084b462d6@%3Cdev.arrow.apache.org%3E
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Mon, May 6, 2019 at 10:39 AM John Muehlhausen <jgm@jgm.org>
> wrote:
> > > > > >
> > > > > > Wes,
> > > > > >
> > > > > > I’m not afraid of writing my own C++ code to deal with all
of
> this
> > > on the
> > > > > > writer side.  I just need a way to “append” (incrementally
> populate)
> > > e.g.
> > > > > > feather files so that a person using e.g. pyarrow doesn’t
suffer
> some
> > > > > > catastrophic failure... and “on the side” I tell them which
rows
> are
> > > junk
> > > > > > and deal with any concurrency issues that can’t be solved
in the
> > > arena of
> > > > > > atomicity and ordering of ops.  For now I care about basic types
> but
> > > > > > including variable-width strings.
> > > > > >
> > > > > > For event-processing, I think Arrow has to have the concept
of a
> > > > > partially
> > > > > > full record set.  Some alternatives are:
> > > > > > - have a batch size of one, thus littering the landscape with
> > > trivially
> > > > > > small Arrow buffers
> > > > > > - artificially increase latency with a batch size larger than
> one,
> > > but
> > > > > not
> > > > > > processing any data until a batch is complete
> > > > > > - continuously re-write the (entire!) “main” buffer as batches
of
> > > length
> > > > > 1
> > > > > > roll in
> > > > > > - instead of one main buffer, several, and at some threshold
> combine
> > > the
> > > > > > last N length-1 batches into a length N buffer ... still an
> > > inefficiency
> > > > > >
> > > > > > Consider the case of QAbstractTableModel as the underlying data
> for a
> > > > > table
> > > > > > or a chart.  This visualization shows all of the data for the
> recent
> > > past
> > > > > > as well as events rolling in.  If this model interface is
> > > implemented as
> > > > > a
> > > > > > view onto “many thousands” of individual event buffers then
we
> gain
> > > > > nothing
> > > > > > from columnar layout.  (Suppose there are tons of columns and
> most of
> > > > > them
> > > > > > are scrolled out of the view.). Likewise we cannot re-write
the
> > > entire
> > > > > > model on each event... time complexity blows up.  What we want
> is to
> > > > > have a
> > > > > > large pre-allocated chunk and just change rowCount() as data
is
> > > > > “appended.”
> > > > > >  Sure, we may run out of space and have another and another
> chunk for
> > > > > > future row ranges, but a handful of chunks chained together
is
> better
> > > > > than
> > > > > > as many chunks as there were events!
> > > > > >
> > > > > > And again, having a batch size >1 and delaying the data until
a
> > > batch is
> > > > > > full is a non-starter.
> > > > > >
> > > > > > I am really hoping to see partially-filled buffers as something
> we
> > > keep
> > > > > our
> > > > > > finger on moving forward!  Or else, what am I missing?
> > > > > >
> > > > > > -John
> > > > > >
> > > > > > On Mon, May 6, 2019 at 8:24 AM Wes McKinney <wesmckinn@gmail.com
> >
> > > wrote:
> > > > > >
> > > > > > > hi John,
> > > > > > >
> > > > > > > In C++ the builder classes don't yet support writing into
> > > preallocated
> > > > > > > memory. It would be tricky for applications to determine
a
> priori
> > > > > > > which segments of memory to pass to the builder. It seems
only
> > > > > > > feasible for primitive / fixed-size types so my guess would
be
> > > that a
> > > > > > > separate set of interfaces would need to be developed for
this
> > > task.
> > > > > > >
> > > > > > > - Wes
> > > > > > >
> > > > > > > On Mon, May 6, 2019 at 8:18 AM Jacques Nadeau <
> jacques@apache.org>
> > > > > wrote:
> > > > > > > >
> > > > > > > > This is more of a question of implementation versus
> > > specification. An
> > > > > > > arrow
> > > > > > > > buffer is generally built and then sealed. In different
> > > languages,
> > > > > this
> > > > > > > > building process works differently (a concern of the
language
> > > rather
> > > > > than
> > > > > > > > the memory specification). We don't currently allow
a half
> built
> > > > > vector
> > > > > > > to
> > > > > > > > be moved to another language and then be further built.
So
> the
> > > > > question
> > > > > > > is
> > > > > > > > really more concrete: what language are you looking
at and
> what
> > > is
> > > > > the
> > > > > > > > specific pattern you're trying to undertake for building.
> > > > > > > >
> > > > > > > > If you're trying to go across independent processes
(whether
> the
> > > same
> > > > > > > > process restarted or two separate processes active
> > > simultaneously)
> > > > > you'll
> > > > > > > > need to build up your own data structures to help
with this.
> > > > > > > >
> > > > > > > > On Mon, May 6, 2019 at 6:28 PM John Muehlhausen <jgm@jgm.org
> >
> > > wrote:
> > > > > > > >
> > > > > > > > > Hello,
> > > > > > > > >
> > > > > > > > > Glad to learn of this project— good work!
> > > > > > > > >
> > > > > > > > > If I allocate a single chunk of memory and start
building
> Arrow
> > > > > format
> > > > > > > > > within it, does this chunk save any state regarding
my
> > > progress?
> > > > > > > > >
> > > > > > > > > For example, suppose I allocate a column for
floating point
> > > (fixed
> > > > > > > width)
> > > > > > > > > and a column for string (variable width).  Suppose
I start
> > > > > building the
> > > > > > > > > floating point column at offset X into my single
buffer,
> and
> > > the
> > > > > string
> > > > > > > > > “pointer” column at offset Y into the same
single buffer,
> and
> > > the
> > > > > > > string
> > > > > > > > > data elements at offset Z.
> > > > > > > > >
> > > > > > > > > I write one floating point number and one string,
then go
> away.
> > > > > When I
> > > > > > > > > come back to this buffer to append another value,
does the
> > > buffer
> > > > > > > itself
> > > > > > > > > know where I would begin?  I.e. is there a differentiation
> in
> > > the
> > > > > > > column
> > > > > > > > > (or blob) data itself between the available space
and the
> used
> > > > > space?
> > > > > > > > >
> > > > > > > > > Suppose I write a lot of large variable width
strings and
> “run
> > > > > out” of
> > > > > > > > > space for them before running out of space for
floating
> point
> > > > > numbers
> > > > > > > or
> > > > > > > > > string pointers.  (I guessed badly when doing
the original
> > > > > > > allocation.). I
> > > > > > > > > consider this to be Ok since I can always “copy”
the data
> to
> > > > > “compress
> > > > > > > out”
> > > > > > > > > the unused fp/pointer buckets... the choice is
up to me.
> > > > > > > > >
> > > > > > > > > The above applied to a (feather?) file is how
I anticipate
> > > > > appending
> > > > > > > data
> > > > > > > > > to disk... pre-allocate a mem-mapped file and
gradually
> fill
> > > it up.
> > > > > > > The
> > > > > > > > > efficiency of file utilization will depend on
my
> projections
> > > > > regarding
> > > > > > > > > variable-width data types, but as I said above,
I can
> always
> > > > > re-write
> > > > > > > the
> > > > > > > > > file if/when this bothers me.
> > > > > > > > >
> > > > > > > > > Is this the recommended and supported approach
for
> incremental
> > > > > appends?
> > > > > > > > > I’m really hoping to use Arrow instead of rolling
my own,
> but
> > > > > > > functionality
> > > > > > > > > like this is absolutely key!  Hoping not to use
a side-car
> > > file (or
> > > > > > > memory
> > > > > > > > > chunk) to store “append progress” information.
> > > > > > > > >
> > > > > > > > > I am brand new to this project so please forgive
me if I
> have
> > > > > > > overlooked
> > > > > > > > > something obvious.  And again, looks like great
work so
> far!
> > > > > > > > >
> > > > > > > > > Thanks!
> > > > > > > > > -John
> > > > > > > > >
> > > > > > >
> > > > >
> > >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message