drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edmon Begoli <ebeg...@gmail.com>
Subject Re: Update on EDI support for Drill - repo and design collaboratory
Date Mon, 14 Sep 2015 03:42:03 GMT
I understand. I hope you and the rest will help me with design guidance as
I start translating EDI format into a Drill-amenable one.

On Sunday, September 13, 2015, Ted Dunning <ted.dunning@gmail.com> wrote:

> I doubt that I will be able to produce significant amounts of code. If I do
> produce much of anything, I would be happy to contribute via pull requests.
>
> So I don't need to be on the repo as a contributor.
>
> On Sun, Sep 13, 2015 at 1:42 PM, Edmon Begoli <ebegoli@gmail.com
> <javascript:;>> wrote:
>
> > Ted, Matt, et al.,
> >
> > I have created temporary repository for design and development of the
> > support for EDI format in Drill.
> > At this point, it is not a fork of Drill, but rather a collaboration
> space
> > and code repository for exploratory code.
> >
> > Wiki:
> > https://github.com/ebegoli/edi-drill-store/wiki
> >
> > Repo:
> > https://github.com/ebegoli/edi-drill-store
> >
> > Once the difficult parts specific to EDI (logical nesting, record
> > representation) are figured out, and generic code written for I/O and
> > translation,
> > I will look to merge this with Drill and blend it into Drill-specific
> > patterns.
> >
> > *If you wish, I will add you to the repo, so you can edit Wiki.*
> >
> > Let me know please.
> >
> > Edmon
> >
> >
> > On Sun, Sep 6, 2015 at 7:16 AM, Edmon Begoli <ebegoli@gmail.com
> <javascript:;>> wrote:
> >
> > > Matt - that is fantastic. Having good, liberally licensed format
> > > converters probably takes care of the 50% of the problem. The other 50%
> > > will be in figuring out the logical mapping.
> > >
> > > Let me think a little bit and propose how can we best set up a
> > > collaboration platform. Any suggestion for this welcome.
> > >
> > > I personally like Google stuff, Hangouts, docs, and Github, of course.
> > >
> > >
> > > On Saturday, September 5, 2015, Matthew Burgess <mattyb149@gmail.com
> <javascript:;>>
> > > wrote:
> > >
> > >> Edmon,
> > >>
> > >> All our Data Integration (file-format parsing, e.g.) code is
> Apache-2.0
> > >> licensed, we have parsers/processors
> > >> <
> > >>
> >
> https://github.com/pentaho/pentaho-kettle/tree/master/engine/src/org/pentah
> > >> o/di/trans/steps
> > >> <
> >
> https://github.com/pentaho/pentaho-kettle/tree/master/engine/src/org/pentaho/di/trans/steps
> > >>
> > >> for EDI / XML(StaX) / HL7 / YAML, etc. I have a plugin
> > >> <https://github.com/mattyb149/load-text-from-file-plugin>  (also
> > >> Apache-2.0)
> > >> using Tika to extract metadata, this could be refactored as a Drill
> > >> plugin.
> > >>
> > >> The (semi-)structured-to-tabular conversion will be an issue that most
> > >> Drill
> > >> extenders will have to deal with, although with powerful functions
> like
> > >> KVGEN() and FLATTEN() it should be less daunting. For graphs
> > >> (highly-structured but non-tabular data sources), I'm also looking
> into
> > a
> > >> Gremlin <http://tinkerpop.incubator.apache.org/>  plugin, which could
> > >> connect Graph Databases with Drill. Again, the problem is representing
> > >> non-tabular data in a SQL environment as you mentioned.
> > >>
> > >> Regards,
> > >> Matt
> > >>
> > >> From:  Edmon Begoli <ebegoli@gmail.com <javascript:;>>
> > >> Reply-To:  <dev@drill.apache.org <javascript:;>>
> > >> Date:  Saturday, September 5, 2015 at 8:46 PM
> > >> To:  <dev@drill.apache.org <javascript:;>>
> > >> Subject:  Re: Data representation and conversation - translating
> nested
> > >> hierarchies into a tabular/queriable format
> > >>
> > >> Matt - any contribution of your time is welcome! Thank you.
> > >>
> > >> These problems that we are wanting to look into are not easy
> problems; I
> > >> would not expect quick solutions, but any good idea, contribution of
> > time,
> > >> or code will help us advance the state of the capabilities.
> > >>
> > >> I might create a branch or separate Github repo, so that we just use
> its
> > >> wiki for documentation and collaboration, and then later for scratch
> pad
> > >> development.
> > >>
> > >> Regarding existing tools you might have - *do you think you could
> bring
> > >> this code under the Apache 2 license?*
> > >> Knowing what you told me before, I think that contributing this code
> > would
> > >> help advance the state of the Drill's format support tremendously.
> > >>
> > >> I see two major challenges related to what I am proposing:
> > >>
> > >> 1. (greater challenge) How to bring heterogeneously structured data
> > >> logically and semantically into the tabular orientation of a typical
> SQL
> > >> query processing engine.
> > >> I think that some problems will not be completely implementable, so
> > we'll
> > >> need to either approximate or make some limiting/bounding design
> > choices.
> > >>
> > >> 2. How to support these new formats through the Drill API. This is
> more
> > of
> > >> just a API study, design and programming effort. Nothing
> contradictory.
> > >>
> > >> Edmon
> > >>
> > >>
> > >>
> > >>
> > >> On Sat, Sep 5, 2015 at 8:12 PM, Matt Burgess <mattyb149@gmail.com
> <javascript:;>>
> > wrote:
> > >>
> > >> >  Challenge accepted! :) are we talking about things like XML,
> Jsonnet,
> > >> >  Yaml, etc.? And/or binary file formats that are (semi-)structured
> in
> > >> nature
> > >> >  like XLSX?
> > >> >
> > >> >  If we want to go more unstructured we could look at Apache Tika to
> at
> > >> >  least pull out metadata on things like image and video files, and
> I'm
> > >> >  tinkering with the idea of a UDF called topics() for
> human-generated
> > >> text
> > >> >  using Apache OpenNLP, the problem being a well-trained model for
> the
> > >> target
> > >> >  data.
> > >> >
> > >> >  Edmon, I admire your ambition and would like to help out
> where/when I
> > >> can.
> > >> >  Having said that, so far my amount of available time for Drill has
> > been
> > >> >  embarrassingly lower than my amount of interest.
> > >> >
> > >> >  For well-known file formats, I may be able to help with some of our
> > >> >  open-source tools for parsing such files.
> > >> >
> > >> >  Regards,
> > >> >  Matt
> > >> >
> > >> >  Sent from my iPhone
> > >> >
> > >> >>  > On Sep 5, 2015, at 7:44 PM, Edmon Begoli <ebegoli@gmail.com
> <javascript:;>>
> > wrote:
> > >> >>  >
> > >> >>  > Anyone else from the Drill team wholeheartedly invited.
> > >> >>  >
> > >> >>  > Edmon
> > >> >>  >
> > >> >>>  >> On Sat, Sep 5, 2015 at 7:04 PM, Edmon Begoli <
> ebegoli@gmail.com <javascript:;>
> > >
> > >> wrote:
> > >> >>>  >>
> > >> >>>  >> Let's do it, Ted. I think it would add tremendous
value to
> Drill
> > >> as a
> > >> >>>  >> solution.
> > >> >>>  >>
> > >> >>>  >> I will start a Google doc and share with you so
we can share
> > >> ideas,
> > >> >>>  >> have Hangouts, design, etc. until we have something
solid to
> put
> > >> into
> > >> >  Drill
> > >> >>>  >> proper.
> > >> >>>  >>
> > >> >>>  >> If you have any other suggestion for the mode of
collaboration
> > >> please
> > >> >  let
> > >> >>>  >> me know.
> > >> >>>  >>
> > >> >>>>  >>> On Saturday, September 5, 2015, Ted Dunning
<
> > >> ted.dunning@gmail.com <javascript:;>>
> > >> >  wrote:
> > >> >>>>  >>>
> > >> >>>>>  >>>> On Sat, Sep 5, 2015 at 8:57 AM,
Edmon Begoli <
> > >> ebegoli@gmail.com <javascript:;>>
> > >> >  wrote:
> > >> >>>>>  >>>>
> > >> >>>>>  >>>> *My question - has this been handled
already in Drill and
> > >> storage
> > >> >>>>  >>> formats?*
> > >> >>>>>  >>>>
> > >> >>>>>  >>>> If so, where?
> > >> >>>>>  >>>>
> > >> >>>>>  >>>> If not,what is your recommendation
for handling this?
> > >> >>>>>  >>>>
> > >> >>>>>  >>>> Should it be in an independent library
outside of Drill
> that
> > >> >>>>> presents
> > >> >  a
> > >> >>>>>  >>>> flattened version (not sure if this
is possible), or maybe
> > >> break the
> > >> >>>>>  >>>> message into tables corresponding
to header data, items,
> > >> footer.
> > >> >>>>  >>>
> > >> >>>>  >>> Drill does handle these kinds of data well,
but currently
> the
> > >> only
> > >> file
> > >> >>>>  >>> formats that it can consume for this kind
of data are JSON
> and
> > >> >>>> Parquet.
> > >> >>>>  >>>
> > >> >>>>  >>> IT would be great to have more.  I would
love to work on
> this
> > >> with
> > >> you.
> > >> >>>  >>
> > >> >
> > >>
> > >>
> > >>
> > >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message