arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: Stored state of incremental writes to fixed size Arrow buffer?
Date Tue, 07 May 2019 19:27:42 GMT
On Tue, May 7, 2019 at 12:26 PM John Muehlhausen <jgm@jgm.org> wrote:
>
> Wes, are we saying that `pa.ipc.open_file(...).read_pandas()` already reads
> the future Feather format? If not, how will the future format differ?  I
> will work on my access pattern with this format instead of the current
> feather format.  Sorry I was not clear on that earlier.
>

Yes, under the hood those will use the same zero-copy binary protocol
code paths to read the file.

> Micah, thank you!
>
> On Tue, May 7, 2019 at 11:44 AM Micah Kornfield <emkornfield@gmail.com>
> wrote:
>
> > Hi John,
> > To give a specific pointer [1] describes how the streaming protocol is
> > stored to a file.
> >
> > [1] https://arrow.apache.org/docs/format/IPC.html#file-format
> >
> > On Tue, May 7, 2019 at 9:40 AM Wes McKinney <wesmckinn@gmail.com> wrote:
> >
> > > hi John,
> > >
> > > As soon as the R folks can install the Arrow R package consistently,
> > > the intent is to replace the Feather internals with the plain Arrow
> > > IPC protocol where we have much better platform support all around.
> > >
> > > If you'd like to experiment with creating an API for pre-allocating
> > > fixed-size Arrow protocol blocks and then mutating the data and
> > > metadata on disk in-place, please be our guest. We don't have the
> > > tools developed yet to do this for you
> > >
> > > - Wes
> > >
> > > On Tue, May 7, 2019 at 11:25 AM John Muehlhausen <jgm@jgm.org> wrote:
> > > >
> > > > Thanks Wes:
> > > >
> > > > "the current Feather format is deprecated" ... yes, but there will be
a
> > > > future file format that replaces it, correct?  And my discussion of
> > > > immutable "portions" of Arrow buffers, rather than immutability of the
> > > > entire buffer, applies to IPC as well, right?  I am only championing
> > the
> > > > idea that this future file format have the convenience that certain
> > > > preallocated rows can be ignored based on a metadata setting.
> > > >
> > > > "I recommend batching your writes" ... this is extremely inefficient
> > and
> > > > adds unacceptable latency, relative to the proposed solution.  Do you
> > > > disagree?  Either I have a batch length of 1 to minimize latency, which
> > > > eliminates columnar advantages on the read side, or else I add latency.
> > > > Neither works, and it seems that a viable alternative is within sight?
> > > >
> > > > If you don't agree that there is a core issue and opportunity here, I'm
> > > not
> > > > sure how to better make my case....
> > > >
> > > > -John
> > > >
> > > > On Tue, May 7, 2019 at 11:02 AM Wes McKinney <wesmckinn@gmail.com>
> > > wrote:
> > > >
> > > > > hi John,
> > > > >
> > > > > On Tue, May 7, 2019 at 10:53 AM John Muehlhausen <jgm@jgm.org>
> > wrote:
> > > > > >
> > > > > > Wes et al, I completed a preliminary study of populating a Feather
> > > file
> > > > > > incrementally.  Some notes and questions:
> > > > > >
> > > > > > I wrote the following dataframe to a feather file:
> > > > > >             a    b
> > > > > > 0  0123456789  0.0
> > > > > > 1  0123456789  NaN
> > > > > > 2  0123456789  NaN
> > > > > > 3  0123456789  NaN
> > > > > > 4        None  NaN
> > > > > >
> > > > > > In re-writing the flatbuffers metadata (flatc -p doesn't
> > > > > > support --gen-mutable! yuck! C++ to the rescue...), it seems
that
> > > > > > read_feather is not affected by NumRows?  It seems to be driven
> > > entirely
> > > > > by
> > > > > > the per-column Length values?
> > > > > >
> > > > > > Also, it seems as if one of the primary usages of NullCount
is to
> > > > > determine
> > > > > > whether or not a bitfield is present?  In the initialization
data
> > > above I
> > > > > > was careful to have a null value in each column in order to
> > generate
> > > a
> > > > > > bitfield.
> > > > >
> > > > > Per my prior e-mails, the current Feather format is deprecated, so
> > I'm
> > > > > only willing to engage on a discussion of the official Arrow binary
> > > > > protocol that we use for IPC (memory mapping) and RPC (Flight).
> > > > >
> > > > > >
> > > > > > I then wiped the bitfields in the file and set all of the string
> > > indices
> > > > > to
> > > > > > one past the end of the blob buffer (all strings empty):
> > > > > >       a   b
> > > > > > 0  None NaN
> > > > > > 1  None NaN
> > > > > > 2  None NaN
> > > > > > 3  None NaN
> > > > > > 4  None NaN
> > > > > >
> > > > > > I then set the first record to some data by consuming some of
the
> > > string
> > > > > > blob and row 0 and 1 indices, also setting the double:
> > > > > >
> > > > > >                a    b
> > > > > > 0  Hello, world!  5.0
> > > > > > 1           None  NaN
> > > > > > 2           None  NaN
> > > > > > 3           None  NaN
> > > > > > 4           None  NaN
> > > > > >
> > > > > > As mentioned above, NumRows seems to be ignored.  I tried adjusting
> > > each
> > > > > > column Length to mask off higher rows and it worked for 4 (hide
> > last
> > > row)
> > > > > > but not for less than 4.  I take this to be due to math used
to
> > > figure
> > > > > out
> > > > > > where the buffers are relative to one another since there is
only
> > one
> > > > > > metadata offset for all of: the (optional) bitset, index column
and
> > > (if
> > > > > > string) blobs.
> > > > > >
> > > > > > Populating subsequent rows would proceed in a similar way until
all
> > > of
> > > > > the
> > > > > > blob storage has been consumed, which may come before the
> > > pre-allocated
> > > > > > rows have been consumed.
> > > > > >
> > > > > > So what does this mean for my desire to incrementally write
these
> > > > > > (potentially memory-mapped) pre-allocated files and/or Arrow
> > buffers
> > > in
> > > > > > memory?  Some thoughts:
> > > > > >
> > > > > > - If a single value (such as NumRows) were consulted to determine
> > the
> > > > > table
> > > > > > row dimension then updating this single value would serve to
tell a
> > > > > reader
> > > > > > which rows are relevant.  Assuming this value is updated after
all
> > > other
> > > > > > mutations are complete, and assuming that mutations are only
> > appends
> > > > > > (addition of rows), then concurrency control involves only ensuring
> > > that
> > > > > > this value is not examined while it is being written.
> > > > > >
> > > > > > - NullCount presents a concurrency problem if someone reads
the
> > file
> > > > > after
> > > > > > this field has been updated, but before NumRows has exposed
the new
> > > > > record
> > > > > > (or vice versa).  The idea previously mentioned that there will
> > > "likely
> > > > > > [be] more statistics in the future" feels like it might be scope
> > > creep to
> > > > > > me?  This is a data representation, not a calculation framework?
> > If
> > > > > > NullCount had its genesis in the optional nature of the bitfield,
I
> > > would
> > > > > > suggest that perhaps NullCount can be dropped in favor of always
> > > > > supplying
> > > > > > the bitfield, which in any event is already contemplated by
the
> > spec:
> > > > > > "Implementations may choose to always allocate one anyway as
a
> > > matter of
> > > > > > convenience."  If the concern is space savings, Arrow is already
an
> > > > > > extremely uncompressed format.  (Compression is something I
would
> > > also
> > > > > > consider to be scope creep as regards Feather... compressed
> > > filesystems
> > > > > can
> > > > > > be employed and there are other compressed dataframe formats.)
> > > However,
> > > > > if
> > > > > > protecting the 4 bytes required to update NowRows turns out
to be
> > no
> > > > > easier
> > > > > > than protecting all of the statistical bytes as well as part
of the
> > > same
> > > > > > "critical section" (locks: yuck!!) then statistics pose no issue.
> > I
> > > > > have a
> > > > > > feeling that the availability of an atomic write of 4 bytes
will
> > > depend
> > > > > on
> > > > > > the storage mechanism... memory vs memory map vs write() etc.
> > > > > >
> > > > > > - The elephant in the room appears to be the presumptive binary
> > > yes/no on
> > > > > > mutability of Arrow buffers.  Perhaps the thought is that certain
> > > batch
> > > > > > processes will be wrecked if anyone anywhere is mutating buffers
in
> > > any
> > > > > > way.  But keep in mind I am not proposing general mutability,
only
> > > > > > appending of new data.  *A huge amount of batch processing that
> > will
> > > take
> > > > > > place with Arrow is on time-series data (whether financial or
> > > otherwise).
> > > > > > It is only natural that architects will want the minimal impedance
> > > > > mismatch
> > > > > > when it comes time to grow those time series as the events occur
> > > going
> > > > > > forward.*  So rather than say that I want "mutable" Arrow buffers,
> > I
> > > > > would
> > > > > > pitch this as a call for "immutable populated areas" of Arrow
> > buffers
> > > > > > combined with the concept that the populated area can grow up
to
> > > whatever
> > > > > > was preallocated.  This will not affect anyone who has "memoized"
a
> > > > > > dimension and wants to continue to consider the then-current
data
> > as
> > > > > > immutable... it will be immutable and will always be immutable
> > > according
> > > > > to
> > > > > > that then-current dimension.
> > > > > >
> > > > > > Thanks in advance for considering this feedback!  I absolutely
> > > require
> > > > > > efficient row-wise growth of an Arrow-like buffer to deal with
time
> > > > > series
> > > > > > data in near real time.  I believe that preallocation is (by
far)
> > the
> > > > > most
> > > > > > efficient way to accomplish this.  I hope to be able to use
Arrow!
> > > If I
> > > > > > cannot use Arrow than I will be using a home-grown Arrow that
is
> > > > > identical
> > > > > > except for this feature, which would be very sad!  Even if Arrow
> > > itself
> > > > > > could be used in this manner today, I would be hesitant to use
it
> > if
> > > the
> > > > > > use-case was not protected on a go-forward basis.
> > > > > >
> > > > >
> > > > > I recommend batching your writes and using the Arrow binary streaming
> > > > > protocol so you are only appending to a file rather than mutating
> > > > > previously-written bytes. This use case is well defined and supported
> > > > > in the software already.
> > > > >
> > > > >
> > > > >
> > >
> > https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst#streaming-format
> > > > >
> > > > > - Wes
> > > > >
> > > > > > Of course, I am completely open to alternative ideas and
> > approaches!
> > > > > >
> > > > > > -John
> > > > > >
> > > > > > On Mon, May 6, 2019 at 11:39 AM Wes McKinney <wesmckinn@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > hi John -- again, I would caution you against using Feather
files
> > > for
> > > > > > > issues of longevity -- the internal memory layout of those
files
> > > is a
> > > > > > > "dead man walking" so to speak.
> > > > > > >
> > > > > > > I would advise against forking the project, IMHO that is
a dark
> > > path
> > > > > > > that leads nowhere good. We have a large community here
and we
> > > accept
> > > > > > > pull requests -- I think the challenge is going to be defining
> > the
> > > use
> > > > > > > case to suitable clarity that a general purpose solution
can be
> > > > > > > developed.
> > > > > > >
> > > > > > > - Wes
> > > > > > >
> > > > > > >
> > > > > > > On Mon, May 6, 2019 at 11:16 AM John Muehlhausen <jgm@jgm.org>
> > > wrote:
> > > > > > > >
> > > > > > > > François, Wes,
> > > > > > > >
> > > > > > > > Thanks for the feedback.  I think the most practical
thing for
> > > me to
> > > > > do
> > > > > > > is
> > > > > > > > 1- write a Feather file that is structured to pre-allocate
the
> > > space
> > > > > I
> > > > > > > need
> > > > > > > > (e.g. initial variable-length strings are of average
size)
> > > > > > > > 2- come up with code to monkey around with the values
contained
> > > in
> > > > > the
> > > > > > > > vectors so that before and after each manipulation
the file is
> > > valid
> > > > > as I
> > > > > > > > walk the rows ... this is a writer that uses memory
mapping
> > > > > > > > 3- check back in here once that works, assuming that
it does,
> > to
> > > see
> > > > > if
> > > > > > > we
> > > > > > > > can bless certain mutation paths
> > > > > > > > 4- if we can't bless certain mutation paths, fork
the project
> > for
> > > > > those
> > > > > > > who
> > > > > > > > care more about stream processing?  That would not
seem to be
> > > ideal
> > > > > as I
> > > > > > > > think mutation in row-order across the data set is
relatively
> > low
> > > > > impact
> > > > > > > on
> > > > > > > > the overall design?
> > > > > > > >
> > > > > > > > Thanks again for engaging the topic!
> > > > > > > > -John
> > > > > > > >
> > > > > > > > On Mon, May 6, 2019 at 10:26 AM Francois Saint-Jacques
<
> > > > > > > > fsaintjacques@gmail.com> wrote:
> > > > > > > >
> > > > > > > > > Hello John,
> > > > > > > > >
> > > > > > > > > Arrow is not yet suited for partial writes. The
specification
> > > only
> > > > > > > > > talks about fully frozen/immutable objects, you're
in
> > > > > implementation
> > > > > > > > > defined territory here. For example, the C++
library assumes
> > > the
> > > > > Array
> > > > > > > > > object is immutable; it memoize the null count,
and likely
> > more
> > > > > > > > > statistics in the future.
> > > > > > > > >
> > > > > > > > > If you want to use pre-allocated buffers and
array, you can
> > > use the
> > > > > > > > > column validity bitmap for this purpose, e.g.
set all null by
> > > > > default
> > > > > > > > > and flip once the row is written. It suffers
from concurrency
> > > > > issues
> > > > > > > > > (+ invalidation issues as pointed) when dealing
with multiple
> > > > > columns.
> > > > > > > > > You'll have to use a barrier of some kind, e.g.
a per-batch
> > > global
> > > > > > > > > atomic (if append-only), or dedicated column(s)
à-la MVCC.
> > But
> > > > > then,
> > > > > > > > > the reader needs to be aware of this and compute
a mask each
> > > time
> > > > > it
> > > > > > > > > needs to query the partial batch.
> > > > > > > > >
> > > > > > > > > This is a common columnar database problem, see
[1] for a
> > > recent
> > > > > paper
> > > > > > > > > on the subject. The usual technique is to store
the recent
> > data
> > > > > > > > > row-wise, and transform it in column-wise when
a threshold is
> > > met
> > > > > akin
> > > > > > > > > to a compaction phase. There was a somewhat related
thread
> > [2]
> > > > > lately
> > > > > > > > > about streaming vs batching. In the end, I think
your
> > solution
> > > > > will be
> > > > > > > > > very application specific.
> > > > > > > > >
> > > > > > > > > François
> > > > > > > > >
> > > > > > > > > [1]
> > https://db.in.tum.de/downloads/publications/datablocks.pdf
> > > > > > > > > [2]
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> > https://lists.apache.org/thread.html/27945533db782361143586fd77ca08e15e96e2f2a5250ff084b462d6@%3Cdev.arrow.apache.org%3E
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Mon, May 6, 2019 at 10:39 AM John Muehlhausen
<
> > jgm@jgm.org>
> > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Wes,
> > > > > > > > > >
> > > > > > > > > > I’m not afraid of writing my own C++ code
to deal with all
> > of
> > > > > this
> > > > > > > on the
> > > > > > > > > > writer side.  I just need a way to “append”
(incrementally
> > > > > populate)
> > > > > > > e.g.
> > > > > > > > > > feather files so that a person using e.g.
pyarrow doesn’t
> > > suffer
> > > > > some
> > > > > > > > > > catastrophic failure... and “on the side”
I tell them which
> > > rows
> > > > > are
> > > > > > > junk
> > > > > > > > > > and deal with any concurrency issues that
can’t be solved
> > in
> > > the
> > > > > > > arena of
> > > > > > > > > > atomicity and ordering of ops.  For now
I care about basic
> > > types
> > > > > but
> > > > > > > > > > including variable-width strings.
> > > > > > > > > >
> > > > > > > > > > For event-processing, I think Arrow has
to have the concept
> > > of a
> > > > > > > > > partially
> > > > > > > > > > full record set.  Some alternatives are:
> > > > > > > > > > - have a batch size of one, thus littering
the landscape
> > with
> > > > > > > trivially
> > > > > > > > > > small Arrow buffers
> > > > > > > > > > - artificially increase latency with a batch
size larger
> > than
> > > > > one,
> > > > > > > but
> > > > > > > > > not
> > > > > > > > > > processing any data until a batch is complete
> > > > > > > > > > - continuously re-write the (entire!) “main”
buffer as
> > > batches of
> > > > > > > length
> > > > > > > > > 1
> > > > > > > > > > roll in
> > > > > > > > > > - instead of one main buffer, several, and
at some
> > threshold
> > > > > combine
> > > > > > > the
> > > > > > > > > > last N length-1 batches into a length N
buffer ... still an
> > > > > > > inefficiency
> > > > > > > > > >
> > > > > > > > > > Consider the case of QAbstractTableModel
as the underlying
> > > data
> > > > > for a
> > > > > > > > > table
> > > > > > > > > > or a chart.  This visualization shows all
of the data for
> > the
> > > > > recent
> > > > > > > past
> > > > > > > > > > as well as events rolling in.  If this model
interface is
> > > > > > > implemented as
> > > > > > > > > a
> > > > > > > > > > view onto “many thousands” of individual
event buffers then
> > > we
> > > > > gain
> > > > > > > > > nothing
> > > > > > > > > > from columnar layout.  (Suppose there are
tons of columns
> > and
> > > > > most of
> > > > > > > > > them
> > > > > > > > > > are scrolled out of the view.). Likewise
we cannot re-write
> > > the
> > > > > > > entire
> > > > > > > > > > model on each event... time complexity blows
up.  What we
> > > want
> > > > > is to
> > > > > > > > > have a
> > > > > > > > > > large pre-allocated chunk and just change
rowCount() as
> > data
> > > is
> > > > > > > > > “appended.”
> > > > > > > > > >  Sure, we may run out of space and have
another and another
> > > > > chunk for
> > > > > > > > > > future row ranges, but a handful of chunks
chained together
> > > is
> > > > > better
> > > > > > > > > than
> > > > > > > > > > as many chunks as there were events!
> > > > > > > > > >
> > > > > > > > > > And again, having a batch size >1 and
delaying the data
> > > until a
> > > > > > > batch is
> > > > > > > > > > full is a non-starter.
> > > > > > > > > >
> > > > > > > > > > I am really hoping to see partially-filled
buffers as
> > > something
> > > > > we
> > > > > > > keep
> > > > > > > > > our
> > > > > > > > > > finger on moving forward!  Or else, what
am I missing?
> > > > > > > > > >
> > > > > > > > > > -John
> > > > > > > > > >
> > > > > > > > > > On Mon, May 6, 2019 at 8:24 AM Wes McKinney
<
> > > wesmckinn@gmail.com
> > > > > >
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > hi John,
> > > > > > > > > > >
> > > > > > > > > > > In C++ the builder classes don't yet
support writing into
> > > > > > > preallocated
> > > > > > > > > > > memory. It would be tricky for applications
to determine
> > a
> > > > > priori
> > > > > > > > > > > which segments of memory to pass to
the builder. It seems
> > > only
> > > > > > > > > > > feasible for primitive / fixed-size
types so my guess
> > > would be
> > > > > > > that a
> > > > > > > > > > > separate set of interfaces would need
to be developed for
> > > this
> > > > > > > task.
> > > > > > > > > > >
> > > > > > > > > > > - Wes
> > > > > > > > > > >
> > > > > > > > > > > On Mon, May 6, 2019 at 8:18 AM Jacques
Nadeau <
> > > > > jacques@apache.org>
> > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > This is more of a question of
implementation versus
> > > > > > > specification. An
> > > > > > > > > > > arrow
> > > > > > > > > > > > buffer is generally built and
then sealed. In different
> > > > > > > languages,
> > > > > > > > > this
> > > > > > > > > > > > building process works differently
(a concern of the
> > > language
> > > > > > > rather
> > > > > > > > > than
> > > > > > > > > > > > the memory specification). We
don't currently allow a
> > > half
> > > > > built
> > > > > > > > > vector
> > > > > > > > > > > to
> > > > > > > > > > > > be moved to another language and
then be further built.
> > > So
> > > > > the
> > > > > > > > > question
> > > > > > > > > > > is
> > > > > > > > > > > > really more concrete: what language
are you looking at
> > > and
> > > > > what
> > > > > > > is
> > > > > > > > > the
> > > > > > > > > > > > specific pattern you're trying
to undertake for
> > building.
> > > > > > > > > > > >
> > > > > > > > > > > > If you're trying to go across
independent processes
> > > (whether
> > > > > the
> > > > > > > same
> > > > > > > > > > > > process restarted or two separate
processes active
> > > > > > > simultaneously)
> > > > > > > > > you'll
> > > > > > > > > > > > need to build up your own data
structures to help with
> > > this.
> > > > > > > > > > > >
> > > > > > > > > > > > On Mon, May 6, 2019 at 6:28 PM
John Muehlhausen <
> > > jgm@jgm.org
> > > > > >
> > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hello,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Glad to learn of this project—
good work!
> > > > > > > > > > > > >
> > > > > > > > > > > > > If I allocate a single chunk
of memory and start
> > > building
> > > > > Arrow
> > > > > > > > > format
> > > > > > > > > > > > > within it, does this chunk
save any state regarding
> > my
> > > > > > > progress?
> > > > > > > > > > > > >
> > > > > > > > > > > > > For example, suppose I allocate
a column for floating
> > > point
> > > > > > > (fixed
> > > > > > > > > > > width)
> > > > > > > > > > > > > and a column for string (variable
width).  Suppose I
> > > start
> > > > > > > > > building the
> > > > > > > > > > > > > floating point column at
offset X into my single
> > > buffer,
> > > > > and
> > > > > > > the
> > > > > > > > > string
> > > > > > > > > > > > > “pointer” column at offset
Y into the same single
> > > buffer,
> > > > > and
> > > > > > > the
> > > > > > > > > > > string
> > > > > > > > > > > > > data elements at offset Z.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I write one floating point
number and one string,
> > then
> > > go
> > > > > away.
> > > > > > > > > When I
> > > > > > > > > > > > > come back to this buffer
to append another value,
> > does
> > > the
> > > > > > > buffer
> > > > > > > > > > > itself
> > > > > > > > > > > > > know where I would begin?
 I.e. is there a
> > > differentiation
> > > > > in
> > > > > > > the
> > > > > > > > > > > column
> > > > > > > > > > > > > (or blob) data itself between
the available space and
> > > the
> > > > > used
> > > > > > > > > space?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Suppose I write a lot of
large variable width strings
> > > and
> > > > > “run
> > > > > > > > > out” of
> > > > > > > > > > > > > space for them before running
out of space for
> > floating
> > > > > point
> > > > > > > > > numbers
> > > > > > > > > > > or
> > > > > > > > > > > > > string pointers.  (I guessed
badly when doing the
> > > original
> > > > > > > > > > > allocation.). I
> > > > > > > > > > > > > consider this to be Ok since
I can always “copy” the
> > > data
> > > > > to
> > > > > > > > > “compress
> > > > > > > > > > > out”
> > > > > > > > > > > > > the unused fp/pointer buckets...
the choice is up to
> > > me.
> > > > > > > > > > > > >
> > > > > > > > > > > > > The above applied to a (feather?)
file is how I
> > > anticipate
> > > > > > > > > appending
> > > > > > > > > > > data
> > > > > > > > > > > > > to disk... pre-allocate a
mem-mapped file and
> > gradually
> > > > > fill
> > > > > > > it up.
> > > > > > > > > > > The
> > > > > > > > > > > > > efficiency of file utilization
will depend on my
> > > > > projections
> > > > > > > > > regarding
> > > > > > > > > > > > > variable-width data types,
but as I said above, I can
> > > > > always
> > > > > > > > > re-write
> > > > > > > > > > > the
> > > > > > > > > > > > > file if/when this bothers
me.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Is this the recommended and
supported approach for
> > > > > incremental
> > > > > > > > > appends?
> > > > > > > > > > > > > I’m really hoping to use
Arrow instead of rolling my
> > > own,
> > > > > but
> > > > > > > > > > > functionality
> > > > > > > > > > > > > like this is absolutely key!
 Hoping not to use a
> > > side-car
> > > > > > > file (or
> > > > > > > > > > > memory
> > > > > > > > > > > > > chunk) to store “append
progress” information.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I am brand new to this project
so please forgive me
> > if
> > > I
> > > > > have
> > > > > > > > > > > overlooked
> > > > > > > > > > > > > something obvious.  And again,
looks like great work
> > so
> > > > > far!
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks!
> > > > > > > > > > > > > -John
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > >
> >

Mime
View raw message