Return-Path: X-Original-To: apmail-crunch-dev-archive@www.apache.org Delivered-To: apmail-crunch-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 92E95FA86 for ; Mon, 8 Apr 2013 05:18:58 +0000 (UTC) Received: (qmail 7653 invoked by uid 500); 8 Apr 2013 05:18:58 -0000 Delivered-To: apmail-crunch-dev-archive@crunch.apache.org Received: (qmail 7562 invoked by uid 500); 8 Apr 2013 05:18:57 -0000 Mailing-List: contact dev-help@crunch.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@crunch.apache.org Delivered-To: mailing list dev@crunch.apache.org Received: (qmail 7532 invoked by uid 99); 8 Apr 2013 05:18:56 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Apr 2013 05:18:56 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,URIBL_DBL_REDIR X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jwills@cloudera.com designates 209.85.220.180 as permitted sender) Received: from [209.85.220.180] (HELO mail-vc0-f180.google.com) (209.85.220.180) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Apr 2013 05:18:52 +0000 Received: by mail-vc0-f180.google.com with SMTP id m16so181876vca.39 for ; Sun, 07 Apr 2013 22:18:31 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:mime-version:in-reply-to:references:from:date:message-id :subject:to:content-type:x-gm-message-state; bh=3Q51TSN1/TnHafpRW6T226Pm8ZaaSU8KvqmiZlj+6Ps=; b=WoTGCNgB8ZnCherO4dpDpfCT6P3TxumLOSwz4myC2+r/ZeX1xt33y5arezxBydW1Yq KiNag1F7g12JfNvncsLFq0mXQfR3Q+W79C0+I8tQVl/sEjcYz46EN14v5UxTIpGOUWpu c/bDsR5CectIRT+ebh98Ja5AUfGwVpcRpxvgTblqW+sNHaGjgWehynXSWbNAjSoWntte cjZOz3tZw3kXrruQdcqJ/8ke/MWa49O+BQaMWuJVXCch0oEDBzdKRT680dm19uhYQaiw ov2zO0mTjwt4KiwRXm5PbXyvPWNeGG4vtNwyYg9WnYjsBFi9xJvy5bFmtiuNFTWn7JPG IXCw== X-Received: by 10.52.155.5 with SMTP id vs5mr12401390vdb.24.1365398311446; Sun, 07 Apr 2013 22:18:31 -0700 (PDT) MIME-Version: 1.0 Received: by 10.58.205.209 with HTTP; Sun, 7 Apr 2013 22:18:11 -0700 (PDT) In-Reply-To: References: From: Josh Wills Date: Sun, 7 Apr 2013 22:18:11 -0700 Message-ID: Subject: Re: Crunch integration with ElasticSearch To: dev Content-Type: multipart/alternative; boundary=089e01633a9454be3204d9d291e2 X-Gm-Message-State: ALoCoQmj4D4zJG1L1r1BKGYkIVDrq3Cht3knXAWsljA+Jd9ai3kSjai6WIThIYokSiCiHYuFKKZm X-Virus-Checked: Checked by ClamAV on apache.org --089e01633a9454be3204d9d291e2 Content-Type: text/plain; charset=ISO-8859-1 Hey Christian, Supe-cool. Replies inlined. On Sun, Apr 7, 2013 at 8:32 PM, Christian Tzolov wrote: > I've been working on Crunch - ElasticSearch (http://www.elasticsearch.org/ > ) > integration over the weekend :) > > Here is my first prototype: > https://github.com/tzolov/elasticsearch-hadoop#crunch and a sample > application: http://bit.ly/Y7lasW. > > It implements ES Source and Target on top of the ES-Hadoop's ( > https://github.com/elasticsearch/elasticsearch-hadoop) ESInputFormat and > ESOutputFormat. > > Not sure though what is the best/right way to build Source/Targets for new > Input/Output Formats? Any suggestions, references? > I built a Source for HCatalog last week as part of ML: https://github.com/cloudera/ml/blob/master/hcatalog/src/main/java/com/cloudera/science/ml/hcatalog/HCatalogSource.java The interesting bit is really in the configureSource method: if the inputId is < 0, then it's a single-input MapReduce job, and you can essentially configure the input just as you would for a regular MapReduce. If the inputId >= 0, then it's a multi-input job (e.g., for a join), and you have to use CrunchInputs w/a FormatBundle object. The FormatBundle wraps an InputFormat or an OutputFormat w/any Configuration settings that the InputFormat/OutputFormat needs. This way, you can have multiple inputs that use the same InputFormat, but have different configuration settings (e.g., when you're joining multiple Avro files together and they each need to have their own schema specified.) > The write to ES is tricky and at the moment looks more like a hack (see the > doc). > > Cheers > Chris > > (P.S The prototype doesn't support AvroTypeFamily yet but I've been looking > at jackson-dataformat-avro kind of solution (ES-Hadoop relies on Jackson > for the JSON serialisation) > I'd like to work on this as well-- I'll take a look tomorrow and try to put together a pull req for anything that I think should be configured differently. J -- Director of Data Science Cloudera Twitter: @josh_wills --089e01633a9454be3204d9d291e2--