drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edmon Begoli <ebeg...@gmail.com>
Subject Re: Data representation and conversation - translating nested hierarchies into a tabular/queriable format
Date Sun, 06 Sep 2015 00:46:59 GMT
Matt - any contribution of your time is welcome! Thank you.

These problems that we are wanting to look into are not easy problems; I
would not expect quick solutions, but any good idea, contribution of time,
or code will help us advance the state of the capabilities.

I might create a branch or separate Github repo, so that we just use its
wiki for documentation and collaboration, and then later for scratch pad
development.

Regarding existing tools you might have - *do you think you could bring
this code under the Apache 2 license?*
Knowing what you told me before, I think that contributing this code would
help advance the state of the Drill's format support tremendously.

I see two major challenges related to what I am proposing:

1. (greater challenge) How to bring heterogeneously structured data
logically and semantically into the tabular orientation of a typical SQL
query processing engine.
I think that some problems will not be completely implementable, so we'll
need to either approximate or make some limiting/bounding design choices.

2. How to support these new formats through the Drill API. This is more of
just a API study, design and programming effort. Nothing contradictory.

Edmon




On Sat, Sep 5, 2015 at 8:12 PM, Matt Burgess <mattyb149@gmail.com> wrote:

> Challenge accepted! :) are we talking about things like XML, Jsonnet,
> Yaml, etc.? And/or binary file formats that are (semi-)structured in nature
> like XLSX?
>
> If we want to go more unstructured we could look at Apache Tika to at
> least pull out metadata on things like image and video files, and I'm
> tinkering with the idea of a UDF called topics() for human-generated text
> using Apache OpenNLP, the problem being a well-trained model for the target
> data.
>
> Edmon, I admire your ambition and would like to help out where/when I can.
> Having said that, so far my amount of available time for Drill has been
> embarrassingly lower than my amount of interest.
>
> For well-known file formats, I may be able to help with some of our
> open-source tools for parsing such files.
>
> Regards,
> Matt
>
> Sent from my iPhone
>
> > On Sep 5, 2015, at 7:44 PM, Edmon Begoli <ebegoli@gmail.com> wrote:
> >
> > Anyone else from the Drill team wholeheartedly invited.
> >
> > Edmon
> >
> >> On Sat, Sep 5, 2015 at 7:04 PM, Edmon Begoli <ebegoli@gmail.com> wrote:
> >>
> >> Let's do it, Ted. I think it would add tremendous value to Drill as a
> >> solution.
> >>
> >> I will start a Google doc and share with you so we can share ideas,
> >> have Hangouts, design, etc. until we have something solid to put into
> Drill
> >> proper.
> >>
> >> If you have any other suggestion for the mode of collaboration please
> let
> >> me know.
> >>
> >>> On Saturday, September 5, 2015, Ted Dunning <ted.dunning@gmail.com>
> wrote:
> >>>
> >>>> On Sat, Sep 5, 2015 at 8:57 AM, Edmon Begoli <ebegoli@gmail.com>
> wrote:
> >>>>
> >>>> *My question - has this been handled already in Drill and storage
> >>> formats?*
> >>>>
> >>>> If so, where?
> >>>>
> >>>> If not,what is your recommendation for handling this?
> >>>>
> >>>> Should it be in an independent library outside of Drill that presents
> a
> >>>> flattened version (not sure if this is possible), or maybe break the
> >>>> message into tables corresponding to header data, items, footer.
> >>>
> >>> Drill does handle these kinds of data well, but currently the only file
> >>> formats that it can consume for this kind of data are JSON and Parquet.
> >>>
> >>> IT would be great to have more.  I would love to work on this with you.
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message