arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wes McKinney <wesmck...@gmail.com>
Subject Re: Text data structures-optimized layout in Arrow
Date Mon, 04 Mar 2019 00:56:28 GMT
hi Edmon,

Since we've just added a C++ API for "extension types" this might be a
place to try these out to define custom container types for text:

https://github.com/apache/arrow/commit/a79cc809883192417920b501e41a0e8b63cd0ad1

I don't have a sense of where such code should go in the project and
how many users it might have. It seems from my perspective better to
build something inside the Arrow community from the outset rather than
deal with a code donation at some point later in time.

It seems we might want to create a "contrib" directory (either
cpp/src/arrow/contrib or cpp/contrib) for new things where we aren't
sure what is to become of the code.

- Wes

On Sat, Mar 2, 2019 at 10:33 PM Edmon Begoli <ebegoli@berkeley.edu> wrote:
>
> Hi Micah,
>
> In short, we recognize that storing text as arrow is possible and easy if
> we are to store text as array of bytes representing characters.
>
> What we are trying to do is to use arrow as the format/carrier between high
> performance text processing steps which like to operate on binary data
> structures (e.g. tries or DAFSAs).
>
> We have a working/draft approach where we would use arrow as the data
> structure carrier, and we would use encoders/decoders for how these
> structures are laid out into arrow layout.
>
> so, it could be something like:
>
> text.to_arrow(infer=true|dafsa|trie|b-trie) : arrow // writes arrow as
> format for the specified encoding. This could be implicit if we could store
> encoding in some kind of manifest
> arrow.to_text(infer=true|dafsa|trie|b-trie) : string // restores text from
> the arrow format, and from a specified encoding, same as above.
>
> Let me know what you think.
>
> Thank you,
> Edmon
>
> On Sat, Mar 2, 2019 at 10:50 PM Micah Kornfield <emkornfield@gmail.com>
> wrote:
>
> > Hi Edmon,
> > This sound interesting, I'm not aware of any optimized text memory layout
> > beyond our standard string layout.   Are there more details about the work
> > you are doing?  It is a little bit hard to tell if this is a good fit for
> > Arrow from your description.
> >
> > Thanks,
> > Micah
> >
> > On Sat, Mar 2, 2019 at 7:39 PM Edmon Begoli <ebegoli@berkeley.edu> wrote:
> >
> > > Colleagues:
> > >
> > > A colleague and I are working on optimized structures for memory and disk
> > > layout for raw and pre-processed text using specialized data structures,
> > > and with a goal of efficient I/O, inter-process transmissions, and
> > > media/memory storage of text-oriented data (e.g. clinical narratives,
> > > radiology and pathology reports, etc.)
> > >
> > > Has anyone on the Arrow dev team tackled this problem of efficient text
> > > storage yet?
> > > (not just plain text, but storing data structures in an arrow format)
> > >
> > > If not, would you welcome a contribution?
> > >
> > > Thank you,
> > > Edmon
> > >
> >

Mime
View raw message