arrow-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edmon Begoli <ebeg...@berkeley.edu>
Subject Re: Text data structures-optimized layout in Arrow
Date Mon, 04 Mar 2019 01:32:21 GMT
Thanks, Wes.

_contrib_ could indeed be a good option for this.

Unless the community objects, I suggest that I create a JIRA issue for this.
We could use that issue for tracking and documentation of the intended
purpose, design thinking, and also add as many details as possible.

My team and I have every intention to implement this functionality, and
within next six months, so it would be indeed good to stay coordinated, and
integrate it into Arrow code base in some non-obtrusive way.

Thank you,
Edmon




On Sun, Mar 3, 2019 at 7:57 PM Wes McKinney <wesmckinn@gmail.com> wrote:

> hi Edmon,
>
> Since we've just added a C++ API for "extension types" this might be a
> place to try these out to define custom container types for text:
>
>
> https://github.com/apache/arrow/commit/a79cc809883192417920b501e41a0e8b63cd0ad1
>
> I don't have a sense of where such code should go in the project and
> how many users it might have. It seems from my perspective better to
> build something inside the Arrow community from the outset rather than
> deal with a code donation at some point later in time.
>
> It seems we might want to create a "contrib" directory (either
> cpp/src/arrow/contrib or cpp/contrib) for new things where we aren't
> sure what is to become of the code.
>
> - Wes
>
> On Sat, Mar 2, 2019 at 10:33 PM Edmon Begoli <ebegoli@berkeley.edu> wrote:
> >
> > Hi Micah,
> >
> > In short, we recognize that storing text as arrow is possible and easy if
> > we are to store text as array of bytes representing characters.
> >
> > What we are trying to do is to use arrow as the format/carrier between
> high
> > performance text processing steps which like to operate on binary data
> > structures (e.g. tries or DAFSAs).
> >
> > We have a working/draft approach where we would use arrow as the data
> > structure carrier, and we would use encoders/decoders for how these
> > structures are laid out into arrow layout.
> >
> > so, it could be something like:
> >
> > text.to_arrow(infer=true|dafsa|trie|b-trie) : arrow // writes arrow as
> > format for the specified encoding. This could be implicit if we could
> store
> > encoding in some kind of manifest
> > arrow.to_text(infer=true|dafsa|trie|b-trie) : string // restores text
> from
> > the arrow format, and from a specified encoding, same as above.
> >
> > Let me know what you think.
> >
> > Thank you,
> > Edmon
> >
> > On Sat, Mar 2, 2019 at 10:50 PM Micah Kornfield <emkornfield@gmail.com>
> > wrote:
> >
> > > Hi Edmon,
> > > This sound interesting, I'm not aware of any optimized text memory
> layout
> > > beyond our standard string layout.   Are there more details about the
> work
> > > you are doing?  It is a little bit hard to tell if this is a good fit
> for
> > > Arrow from your description.
> > >
> > > Thanks,
> > > Micah
> > >
> > > On Sat, Mar 2, 2019 at 7:39 PM Edmon Begoli <ebegoli@berkeley.edu>
> wrote:
> > >
> > > > Colleagues:
> > > >
> > > > A colleague and I are working on optimized structures for memory and
> disk
> > > > layout for raw and pre-processed text using specialized data
> structures,
> > > > and with a goal of efficient I/O, inter-process transmissions, and
> > > > media/memory storage of text-oriented data (e.g. clinical narratives,
> > > > radiology and pathology reports, etc.)
> > > >
> > > > Has anyone on the Arrow dev team tackled this problem of efficient
> text
> > > > storage yet?
> > > > (not just plain text, but storing data structures in an arrow format)
> > > >
> > > > If not, would you welcome a contribution?
> > > >
> > > > Thank you,
> > > > Edmon
> > > >
> > >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message