crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: Crunch integration with ElasticSearch
Date Mon, 08 Apr 2013 13:51:43 GMT
On Mon, Apr 8, 2013 at 2:58 AM, Christian Tzolov <christian.tzolov@gmail.com
> wrote:

> Hey Josh,
>
> Thanks for the tips!
>
> I followed the HBaseSource.java for implementing the ESSource and copied
> the inputId handling approach:
>
> https://github.com/tzolov/elasticsearch-hadoop/blob/master/src/main/java/org/elasticsearch/hadoop/crunch/ESSource.java
>
> I don't completely understand the implication of the dummy Path parameter.
> In this context is the Path needed only for input equality check?
>
> The ESTarget is more tricky. I was not sure what to do with the keyClass
> parameter in the CrunchOutputs.addNamedOutput() so I've set it to String.
> The ES-Hadoop uses Jackson for JSON serializations and it fails when trying
> to serialize internal Crunch Writable types. I guess because they are not
> public. Storing internal Crunch Writable types in ES doesn't make much
> sense anyway. The current implementation expects a custom (Writable) class
> to define the JSON format. Perhaps with Avro we can try to reuse the Avro
> schema.
>
> Here is the ES-Hadoop ticket for adding Crunch to the ES-Hadoop project:
> https://github.com/elasticsearch/elasticsearch-hadoop/issues/20
>
> Shall we deploy the 0.6.0-SNAPSHOT in some public snapshot repo? The
> https://repository.apache.org/content/groups/snapshots/org/apache/crunch/is
> empty. Perhaps we can deploy the latest Jenkins builds into this
> snapshot repo? Unless there is some policy against it?
>

I just think it means it's time to cut the 0.6.0 release. I would have
liked to get CRUNCH-165 in as well, but I don't think it's been tested
enough.


> Cheers,
> Chris
>
>
>
>
>
>
>
>
> On Mon, Apr 8, 2013 at 7:18 AM, Josh Wills <jwills@cloudera.com> wrote:
>
> > Hey Christian,
> >
> > Supe-cool. Replies inlined.
> >
> > On Sun, Apr 7, 2013 at 8:32 PM, Christian Tzolov <
> > christian.tzolov@gmail.com
> > > wrote:
> >
> > > I've been working on Crunch - ElasticSearch (
> > http://www.elasticsearch.org/
> > > )
> > >  integration over the weekend :)
> > >
> > > Here is my first prototype:
> > > https://github.com/tzolov/elasticsearch-hadoop#crunch and a sample
> > > application: http://bit.ly/Y7lasW.
> > >
> > > It implements ES Source and Target on top of the ES-Hadoop's (
> > > https://github.com/elasticsearch/elasticsearch-hadoop) ESInputFormat
> and
> > > ESOutputFormat.
> > >
> > > Not sure though what is the best/right way to build Source/Targets for
> > new
> > > Input/Output Formats? Any suggestions, references?
> > >
> >
> > I built a Source for HCatalog last week as part of ML:
> >
> >
> >
> https://github.com/cloudera/ml/blob/master/hcatalog/src/main/java/com/cloudera/science/ml/hcatalog/HCatalogSource.java
> >
> > The interesting bit is really in the configureSource method: if the
> inputId
> > is < 0, then it's a single-input MapReduce job, and you can essentially
> > configure the input just as you would for a regular MapReduce. If the
> > inputId >= 0, then it's a multi-input job (e.g., for a join), and you
> have
> > to use CrunchInputs w/a FormatBundle object. The FormatBundle wraps an
> > InputFormat or an OutputFormat w/any Configuration settings that the
> > InputFormat/OutputFormat needs. This way, you can have multiple inputs
> that
> > use the same InputFormat, but have different configuration settings
> (e.g.,
> > when you're joining multiple Avro files together and they each need to
> have
> > their own schema specified.)
> >
> >
> >
> > > The write to ES is tricky and at the moment looks more like a hack (see
> > the
> > > doc).
> > >
> > > Cheers
> > > Chris
> > >
> > > (P.S The prototype doesn't support AvroTypeFamily yet but I've been
> > looking
> > > at jackson-dataformat-avro kind of solution (ES-Hadoop relies on
> Jackson
> > > for the JSON serialisation)
> > >
> >
> > I'd like to work on this as well-- I'll take a look tomorrow and try to
> put
> > together a pull req for anything that I think should be configured
> > differently.
> >
> > J
> >
> >
> >
> > --
> > Director of Data Science
> > Cloudera <http://www.cloudera.com>
> > Twitter: @josh_wills <http://twitter.com/josh_wills>
> >
>



-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message