crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christian Tzolov <christian.tzo...@gmail.com>
Subject Re: Crunch integration with ElasticSearch
Date Mon, 08 Apr 2013 09:58:11 GMT
Hey Josh,

Thanks for the tips!

I followed the HBaseSource.java for implementing the ESSource and copied
the inputId handling approach:
https://github.com/tzolov/elasticsearch-hadoop/blob/master/src/main/java/org/elasticsearch/hadoop/crunch/ESSource.java

I don't completely understand the implication of the dummy Path parameter.
In this context is the Path needed only for input equality check?

The ESTarget is more tricky. I was not sure what to do with the keyClass
parameter in the CrunchOutputs.addNamedOutput() so I've set it to String.
The ES-Hadoop uses Jackson for JSON serializations and it fails when trying
to serialize internal Crunch Writable types. I guess because they are not
public. Storing internal Crunch Writable types in ES doesn't make much
sense anyway. The current implementation expects a custom (Writable) class
to define the JSON format. Perhaps with Avro we can try to reuse the Avro
schema.

Here is the ES-Hadoop ticket for adding Crunch to the ES-Hadoop project:
https://github.com/elasticsearch/elasticsearch-hadoop/issues/20

Shall we deploy the 0.6.0-SNAPSHOT in some public snapshot repo? The
https://repository.apache.org/content/groups/snapshots/org/apache/crunch/is
empty. Perhaps we can deploy the latest Jenkins builds into this
snapshot repo? Unless there is some policy against it?

Cheers,
Chris








On Mon, Apr 8, 2013 at 7:18 AM, Josh Wills <jwills@cloudera.com> wrote:

> Hey Christian,
>
> Supe-cool. Replies inlined.
>
> On Sun, Apr 7, 2013 at 8:32 PM, Christian Tzolov <
> christian.tzolov@gmail.com
> > wrote:
>
> > I've been working on Crunch - ElasticSearch (
> http://www.elasticsearch.org/
> > )
> >  integration over the weekend :)
> >
> > Here is my first prototype:
> > https://github.com/tzolov/elasticsearch-hadoop#crunch and a sample
> > application: http://bit.ly/Y7lasW.
> >
> > It implements ES Source and Target on top of the ES-Hadoop's (
> > https://github.com/elasticsearch/elasticsearch-hadoop) ESInputFormat and
> > ESOutputFormat.
> >
> > Not sure though what is the best/right way to build Source/Targets for
> new
> > Input/Output Formats? Any suggestions, references?
> >
>
> I built a Source for HCatalog last week as part of ML:
>
>
> https://github.com/cloudera/ml/blob/master/hcatalog/src/main/java/com/cloudera/science/ml/hcatalog/HCatalogSource.java
>
> The interesting bit is really in the configureSource method: if the inputId
> is < 0, then it's a single-input MapReduce job, and you can essentially
> configure the input just as you would for a regular MapReduce. If the
> inputId >= 0, then it's a multi-input job (e.g., for a join), and you have
> to use CrunchInputs w/a FormatBundle object. The FormatBundle wraps an
> InputFormat or an OutputFormat w/any Configuration settings that the
> InputFormat/OutputFormat needs. This way, you can have multiple inputs that
> use the same InputFormat, but have different configuration settings (e.g.,
> when you're joining multiple Avro files together and they each need to have
> their own schema specified.)
>
>
>
> > The write to ES is tricky and at the moment looks more like a hack (see
> the
> > doc).
> >
> > Cheers
> > Chris
> >
> > (P.S The prototype doesn't support AvroTypeFamily yet but I've been
> looking
> > at jackson-dataformat-avro kind of solution (ES-Hadoop relies on Jackson
> > for the JSON serialisation)
> >
>
> I'd like to work on this as well-- I'll take a look tomorrow and try to put
> together a pull req for anything that I think should be configured
> differently.
>
> J
>
>
>
> --
> Director of Data Science
> Cloudera <http://www.cloudera.com>
> Twitter: @josh_wills <http://twitter.com/josh_wills>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message