drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Update on EDI support for Drill - repo and design collaboratory
Date Mon, 14 Sep 2015 06:07:12 GMT
Take a look at the JSON input format plugin.  That can't be  cloned outside
of Drill at this point because it involves access to some internals, but it
should provide some guidance about how to read complex objects.



On Sun, Sep 13, 2015 at 8:42 PM, Edmon Begoli <ebegoli@gmail.com> wrote:

> I understand. I hope you and the rest will help me with design guidance as
> I start translating EDI format into a Drill-amenable one.
>
> On Sunday, September 13, 2015, Ted Dunning <ted.dunning@gmail.com> wrote:
>
> > I doubt that I will be able to produce significant amounts of code. If I
> do
> > produce much of anything, I would be happy to contribute via pull
> requests.
> >
> > So I don't need to be on the repo as a contributor.
> >
> > On Sun, Sep 13, 2015 at 1:42 PM, Edmon Begoli <ebegoli@gmail.com
> > <javascript:;>> wrote:
> >
> > > Ted, Matt, et al.,
> > >
> > > I have created temporary repository for design and development of the
> > > support for EDI format in Drill.
> > > At this point, it is not a fork of Drill, but rather a collaboration
> > space
> > > and code repository for exploratory code.
> > >
> > > Wiki:
> > > https://github.com/ebegoli/edi-drill-store/wiki
> > >
> > > Repo:
> > > https://github.com/ebegoli/edi-drill-store
> > >
> > > Once the difficult parts specific to EDI (logical nesting, record
> > > representation) are figured out, and generic code written for I/O and
> > > translation,
> > > I will look to merge this with Drill and blend it into Drill-specific
> > > patterns.
> > >
> > > *If you wish, I will add you to the repo, so you can edit Wiki.*
> > >
> > > Let me know please.
> > >
> > > Edmon
> > >
> > >
> > > On Sun, Sep 6, 2015 at 7:16 AM, Edmon Begoli <ebegoli@gmail.com
> > <javascript:;>> wrote:
> > >
> > > > Matt - that is fantastic. Having good, liberally licensed format
> > > > converters probably takes care of the 50% of the problem. The other
> 50%
> > > > will be in figuring out the logical mapping.
> > > >
> > > > Let me think a little bit and propose how can we best set up a
> > > > collaboration platform. Any suggestion for this welcome.
> > > >
> > > > I personally like Google stuff, Hangouts, docs, and Github, of
> course.
> > > >
> > > >
> > > > On Saturday, September 5, 2015, Matthew Burgess <mattyb149@gmail.com
> > <javascript:;>>
> > > > wrote:
> > > >
> > > >> Edmon,
> > > >>
> > > >> All our Data Integration (file-format parsing, e.g.) code is
> > Apache-2.0
> > > >> licensed, we have parsers/processors
> > > >> <
> > > >>
> > >
> >
> https://github.com/pentaho/pentaho-kettle/tree/master/engine/src/org/pentah
> > > >> o/di/trans/steps
> > > >> <
> > >
> >
> https://github.com/pentaho/pentaho-kettle/tree/master/engine/src/org/pentaho/di/trans/steps
> > > >>
> > > >> for EDI / XML(StaX) / HL7 / YAML, etc. I have a plugin
> > > >> <https://github.com/mattyb149/load-text-from-file-plugin>  (also
> > > >> Apache-2.0)
> > > >> using Tika to extract metadata, this could be refactored as a Drill
> > > >> plugin.
> > > >>
> > > >> The (semi-)structured-to-tabular conversion will be an issue that
> most
> > > >> Drill
> > > >> extenders will have to deal with, although with powerful functions
> > like
> > > >> KVGEN() and FLATTEN() it should be less daunting. For graphs
> > > >> (highly-structured but non-tabular data sources), I'm also looking
> > into
> > > a
> > > >> Gremlin <http://tinkerpop.incubator.apache.org/>  plugin, which
> could
> > > >> connect Graph Databases with Drill. Again, the problem is
> representing
> > > >> non-tabular data in a SQL environment as you mentioned.
> > > >>
> > > >> Regards,
> > > >> Matt
> > > >>
> > > >> From:  Edmon Begoli <ebegoli@gmail.com <javascript:;>>
> > > >> Reply-To:  <dev@drill.apache.org <javascript:;>>
> > > >> Date:  Saturday, September 5, 2015 at 8:46 PM
> > > >> To:  <dev@drill.apache.org <javascript:;>>
> > > >> Subject:  Re: Data representation and conversation - translating
> > nested
> > > >> hierarchies into a tabular/queriable format
> > > >>
> > > >> Matt - any contribution of your time is welcome! Thank you.
> > > >>
> > > >> These problems that we are wanting to look into are not easy
> > problems; I
> > > >> would not expect quick solutions, but any good idea, contribution
of
> > > time,
> > > >> or code will help us advance the state of the capabilities.
> > > >>
> > > >> I might create a branch or separate Github repo, so that we just use
> > its
> > > >> wiki for documentation and collaboration, and then later for scratch
> > pad
> > > >> development.
> > > >>
> > > >> Regarding existing tools you might have - *do you think you could
> > bring
> > > >> this code under the Apache 2 license?*
> > > >> Knowing what you told me before, I think that contributing this code
> > > would
> > > >> help advance the state of the Drill's format support tremendously.
> > > >>
> > > >> I see two major challenges related to what I am proposing:
> > > >>
> > > >> 1. (greater challenge) How to bring heterogeneously structured data
> > > >> logically and semantically into the tabular orientation of a typical
> > SQL
> > > >> query processing engine.
> > > >> I think that some problems will not be completely implementable, so
> > > we'll
> > > >> need to either approximate or make some limiting/bounding design
> > > choices.
> > > >>
> > > >> 2. How to support these new formats through the Drill API. This is
> > more
> > > of
> > > >> just a API study, design and programming effort. Nothing
> > contradictory.
> > > >>
> > > >> Edmon
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> On Sat, Sep 5, 2015 at 8:12 PM, Matt Burgess <mattyb149@gmail.com
> > <javascript:;>>
> > > wrote:
> > > >>
> > > >> >  Challenge accepted! :) are we talking about things like XML,
> > Jsonnet,
> > > >> >  Yaml, etc.? And/or binary file formats that are (semi-)structured
> > in
> > > >> nature
> > > >> >  like XLSX?
> > > >> >
> > > >> >  If we want to go more unstructured we could look at Apache Tika
> to
> > at
> > > >> >  least pull out metadata on things like image and video files,
and
> > I'm
> > > >> >  tinkering with the idea of a UDF called topics() for
> > human-generated
> > > >> text
> > > >> >  using Apache OpenNLP, the problem being a well-trained model
for
> > the
> > > >> target
> > > >> >  data.
> > > >> >
> > > >> >  Edmon, I admire your ambition and would like to help out
> > where/when I
> > > >> can.
> > > >> >  Having said that, so far my amount of available time for Drill
> has
> > > been
> > > >> >  embarrassingly lower than my amount of interest.
> > > >> >
> > > >> >  For well-known file formats, I may be able to help with some
of
> our
> > > >> >  open-source tools for parsing such files.
> > > >> >
> > > >> >  Regards,
> > > >> >  Matt
> > > >> >
> > > >> >  Sent from my iPhone
> > > >> >
> > > >> >>  > On Sep 5, 2015, at 7:44 PM, Edmon Begoli <ebegoli@gmail.com
> > <javascript:;>>
> > > wrote:
> > > >> >>  >
> > > >> >>  > Anyone else from the Drill team wholeheartedly invited.
> > > >> >>  >
> > > >> >>  > Edmon
> > > >> >>  >
> > > >> >>>  >> On Sat, Sep 5, 2015 at 7:04 PM, Edmon Begoli
<
> > ebegoli@gmail.com <javascript:;>
> > > >
> > > >> wrote:
> > > >> >>>  >>
> > > >> >>>  >> Let's do it, Ted. I think it would add tremendous
value to
> > Drill
> > > >> as a
> > > >> >>>  >> solution.
> > > >> >>>  >>
> > > >> >>>  >> I will start a Google doc and share with you
so we can share
> > > >> ideas,
> > > >> >>>  >> have Hangouts, design, etc. until we have something
solid to
> > put
> > > >> into
> > > >> >  Drill
> > > >> >>>  >> proper.
> > > >> >>>  >>
> > > >> >>>  >> If you have any other suggestion for the mode
of
> collaboration
> > > >> please
> > > >> >  let
> > > >> >>>  >> me know.
> > > >> >>>  >>
> > > >> >>>>  >>> On Saturday, September 5, 2015, Ted
Dunning <
> > > >> ted.dunning@gmail.com <javascript:;>>
> > > >> >  wrote:
> > > >> >>>>  >>>
> > > >> >>>>>  >>>> On Sat, Sep 5, 2015 at 8:57
AM, Edmon Begoli <
> > > >> ebegoli@gmail.com <javascript:;>>
> > > >> >  wrote:
> > > >> >>>>>  >>>>
> > > >> >>>>>  >>>> *My question - has this been
handled already in Drill
> and
> > > >> storage
> > > >> >>>>  >>> formats?*
> > > >> >>>>>  >>>>
> > > >> >>>>>  >>>> If so, where?
> > > >> >>>>>  >>>>
> > > >> >>>>>  >>>> If not,what is your recommendation
for handling this?
> > > >> >>>>>  >>>>
> > > >> >>>>>  >>>> Should it be in an independent
library outside of Drill
> > that
> > > >> >>>>> presents
> > > >> >  a
> > > >> >>>>>  >>>> flattened version (not sure
if this is possible), or
> maybe
> > > >> break the
> > > >> >>>>>  >>>> message into tables corresponding
to header data, items,
> > > >> footer.
> > > >> >>>>  >>>
> > > >> >>>>  >>> Drill does handle these kinds of data
well, but currently
> > the
> > > >> only
> > > >> file
> > > >> >>>>  >>> formats that it can consume for this
kind of data are JSON
> > and
> > > >> >>>> Parquet.
> > > >> >>>>  >>>
> > > >> >>>>  >>> IT would be great to have more.  I
would love to work on
> > this
> > > >> with
> > > >> you.
> > > >> >>>  >>
> > > >> >
> > > >>
> > > >>
> > > >>
> > > >>
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message