arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edmon Begoli <ebeg...@berkeley.edu>
Subject Re: Text data structures-optimized layout in Arrow
Date Sun, 03 Mar 2019 04:29:36 GMT
Hi Micah,

In short, we recognize that storing text as arrow is possible and easy if
we are to store text as array of bytes representing characters.

What we are trying to do is to use arrow as the format/carrier between high
performance text processing steps which like to operate on binary data
structures (e.g. tries or DAFSAs).

We have a working/draft approach where we would use arrow as the data
structure carrier, and we would use encoders/decoders for how these
structures are laid out into arrow layout.

so, it could be something like:

text.to_arrow(infer=true|dafsa|trie|b-trie) : arrow // writes arrow as
format for the specified encoding. This could be implicit if we could store
encoding in some kind of manifest
arrow.to_text(infer=true|dafsa|trie|b-trie) : string // restores text from
the arrow format, and from a specified encoding, same as above.

Let me know what you think.

Thank you,
Edmon

On Sat, Mar 2, 2019 at 10:50 PM Micah Kornfield <emkornfield@gmail.com>
wrote:

> Hi Edmon,
> This sound interesting, I'm not aware of any optimized text memory layout
> beyond our standard string layout.   Are there more details about the work
> you are doing?  It is a little bit hard to tell if this is a good fit for
> Arrow from your description.
>
> Thanks,
> Micah
>
> On Sat, Mar 2, 2019 at 7:39 PM Edmon Begoli <ebegoli@berkeley.edu> wrote:
>
> > Colleagues:
> >
> > A colleague and I are working on optimized structures for memory and disk
> > layout for raw and pre-processed text using specialized data structures,
> > and with a goal of efficient I/O, inter-process transmissions, and
> > media/memory storage of text-oriented data (e.g. clinical narratives,
> > radiology and pathology reports, etc.)
> >
> > Has anyone on the Arrow dev team tackled this problem of efficient text
> > storage yet?
> > (not just plain text, but storing data structures in an arrow format)
> >
> > If not, would you welcome a contribution?
> >
> > Thank you,
> > Edmon
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message