arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: Stored state of incremental writes to fixed size Arrow buffer?
Date Mon, 06 May 2019 13:23:57 GMT
hi John,

In C++ the builder classes don't yet support writing into preallocated
memory. It would be tricky for applications to determine a priori
which segments of memory to pass to the builder. It seems only
feasible for primitive / fixed-size types so my guess would be that a
separate set of interfaces would need to be developed for this task.

- Wes

On Mon, May 6, 2019 at 8:18 AM Jacques Nadeau <jacques@apache.org> wrote:
>
> This is more of a question of implementation versus specification. An arrow
> buffer is generally built and then sealed. In different languages, this
> building process works differently (a concern of the language rather than
> the memory specification). We don't currently allow a half built vector to
> be moved to another language and then be further built. So the question is
> really more concrete: what language are you looking at and what is the
> specific pattern you're trying to undertake for building.
>
> If you're trying to go across independent processes (whether the same
> process restarted or two separate processes active simultaneously) you'll
> need to build up your own data structures to help with this.
>
> On Mon, May 6, 2019 at 6:28 PM John Muehlhausen <jgm@jgm.org> wrote:
>
> > Hello,
> >
> > Glad to learn of this project— good work!
> >
> > If I allocate a single chunk of memory and start building Arrow format
> > within it, does this chunk save any state regarding my progress?
> >
> > For example, suppose I allocate a column for floating point (fixed width)
> > and a column for string (variable width).  Suppose I start building the
> > floating point column at offset X into my single buffer, and the string
> > “pointer” column at offset Y into the same single buffer, and the string
> > data elements at offset Z.
> >
> > I write one floating point number and one string, then go away.  When I
> > come back to this buffer to append another value, does the buffer itself
> > know where I would begin?  I.e. is there a differentiation in the column
> > (or blob) data itself between the available space and the used space?
> >
> > Suppose I write a lot of large variable width strings and “run out” of
> > space for them before running out of space for floating point numbers or
> > string pointers.  (I guessed badly when doing the original allocation.). I
> > consider this to be Ok since I can always “copy” the data to “compress out”
> > the unused fp/pointer buckets... the choice is up to me.
> >
> > The above applied to a (feather?) file is how I anticipate appending data
> > to disk... pre-allocate a mem-mapped file and gradually fill it up.  The
> > efficiency of file utilization will depend on my projections regarding
> > variable-width data types, but as I said above, I can always re-write the
> > file if/when this bothers me.
> >
> > Is this the recommended and supported approach for incremental appends?
> > I’m really hoping to use Arrow instead of rolling my own, but functionality
> > like this is absolutely key!  Hoping not to use a side-car file (or memory
> > chunk) to store “append progress” information.
> >
> > I am brand new to this project so please forgive me if I have overlooked
> > something obvious.  And again, looks like great work so far!
> >
> > Thanks!
> > -John
> >

Mime
View raw message