Mailing-List: contact dev-help@crunch.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@crunch.apache.org
Received-SPF: pass (athena.apache.org: domain of jwills@cloudera.com
 designates 209.85.220.180 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAGNdOEWy+mOfn4O5fdbE5gdYPU_8SV4aoJHxPDZ+XqHPrg5kWQ@mail.gmail.com>
References: 
 <CAGNdOEWy+mOfn4O5fdbE5gdYPU_8SV4aoJHxPDZ+XqHPrg5kWQ@mail.gmail.com>
From: Josh Wills <jwills@cloudera.com>
Date: Sun, 7 Apr 2013 22:18:11 -0700
Message-ID: 
 <CAH29n6P7w6CudpckQpj0GdBzfSpzK7TGWdm8KwfETfSTKF7r8g@mail.gmail.com>
Subject: Re: Crunch integration with ElasticSearch
To: dev <dev@crunch.apache.org>
Content-Type: multipart/alternative; boundary=089e01633a9454be3204d9d291e2

--089e01633a9454be3204d9d291e2
Content-Type: text/plain; charset=ISO-8859-1

Hey Christian,

Supe-cool. Replies inlined.

On Sun, Apr 7, 2013 at 8:32 PM, Christian Tzolov <christian.tzolov@gmail.com
> wrote:

> I've been working on Crunch - ElasticSearch (http://www.elasticsearch.org/
> )
>  integration over the weekend :)
>
> Here is my first prototype:
> https://github.com/tzolov/elasticsearch-hadoop#crunch and a sample
> application: http://bit.ly/Y7lasW.
>
> It implements ES Source and Target on top of the ES-Hadoop's (
> https://github.com/elasticsearch/elasticsearch-hadoop) ESInputFormat and
> ESOutputFormat.
>
> Not sure though what is the best/right way to build Source/Targets for new
> Input/Output Formats? Any suggestions, references?
>

I built a Source for HCatalog last week as part of ML:

https://github.com/cloudera/ml/blob/master/hcatalog/src/main/java/com/cloudera/science/ml/hcatalog/HCatalogSource.java

The interesting bit is really in the configureSource method: if the inputId
is < 0, then it's a single-input MapReduce job, and you can essentially
configure the input just as you would for a regular MapReduce. If the
inputId >= 0, then it's a multi-input job (e.g., for a join), and you have
to use CrunchInputs w/a FormatBundle object. The FormatBundle wraps an
InputFormat or an OutputFormat w/any Configuration settings that the
InputFormat/OutputFormat needs. This way, you can have multiple inputs that
use the same InputFormat, but have different configuration settings (e.g.,
when you're joining multiple Avro files together and they each need to have
their own schema specified.)


> The write to ES is tricky and at the moment looks more like a hack (see the
> doc).
>
> Cheers
> Chris
>
> (P.S The prototype doesn't support AvroTypeFamily yet but I've been looking
> at jackson-dataformat-avro kind of solution (ES-Hadoop relies on Jackson
> for the JSON serialisation)
>

I'd like to work on this as well-- I'll take a look tomorrow and try to put
together a pull req for anything that I think should be configured
differently.

J


-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

--089e01633a9454be3204d9d291e2--