Return-Path: X-Original-To: apmail-avro-user-archive@www.apache.org Delivered-To: apmail-avro-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id EA4F610617 for ; Thu, 25 Apr 2013 15:27:24 +0000 (UTC) Received: (qmail 42508 invoked by uid 500); 25 Apr 2013 15:27:24 -0000 Delivered-To: apmail-avro-user-archive@avro.apache.org Received: (qmail 42458 invoked by uid 500); 25 Apr 2013 15:27:24 -0000 Mailing-List: contact user-help@avro.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@avro.apache.org Delivered-To: mailing list user@avro.apache.org Received: (qmail 42449 invoked by uid 99); 25 Apr 2013 15:27:24 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 25 Apr 2013 15:27:24 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_SOFTFAIL X-Spam-Check-By: apache.org Received-SPF: softfail (athena.apache.org: transitioning domain of sripad@path.com does not designate 209.85.212.44 as permitted sender) Received: from [209.85.212.44] (HELO mail-vb0-f44.google.com) (209.85.212.44) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 25 Apr 2013 15:27:20 +0000 Received: by mail-vb0-f44.google.com with SMTP id e13so378568vbg.17 for ; Thu, 25 Apr 2013 08:26:59 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:references:from:in-reply-to:mime-version:date:message-id :subject:to:content-type:x-gm-message-state; bh=2a+k+qoeKzMoHUIYfWa4tMQky5Tj6F+MLgjlVgxod7Q=; b=OmWy8tFP35WgDK2e+Up1/Il4+SEgxkIxPyPklFsehrzRJ6eCAGnExEdAy12vFc8vz+ no7hpCwSThoa+zpaJdy8e7mTPGAW2uPi+ppzKcRtcm0v6AV1AM0i6HbIiRhNbBKlx7Cn l177hbWd1MIdlTSHn3oxRGA3e843D8iGuplekN+uulQYVFTRQeV0PcgMRdYusHu0ZFuh IfmXGhwZIh5PF1WN1PNG9V3elfxvdVqjFc4k2ETr+AAy6nCTy55RwaawQ18iXBHsJzoe HHW/aoIpNvKCiIGdEwXyPA/zKb97OmFJe8VWUitKjgL4W0ndiYk/ueSfGbQlutFrXafk UYMQ== X-Received: by 10.52.111.100 with SMTP id ih4mr22911961vdb.98.1366903619739; Thu, 25 Apr 2013 08:26:59 -0700 (PDT) References: From: Sripad Sriram In-Reply-To: Mime-Version: 1.0 (1.0) Date: Thu, 25 Apr 2013 08:26:58 -0700 Message-ID: <-6113640886534440207@unknownmsgid> Subject: Re: Joining Avro input files in using Java mapreduce To: "user@avro.apache.org" Content-Type: multipart/alternative; boundary=bcaec54862fab272c104db310c16 X-Gm-Message-State: ALoCoQlesGxtzziF9wD+bTk2er4InRjqRb4yFk2yGC+ictQcZF3JdXUJD/mGbNcRaX3TwK4nw9qu X-Virus-Checked: Checked by ClamAV on apache.org --bcaec54862fab272c104db310c16 Content-Type: text/plain; charset=ISO-8859-1 Thanks! Martin, would you happen to have a gist of an example? Did you mean the reducer input is NullWritable? On Apr 25, 2013, at 7:44 AM, Martin Kleppmann wrote: Oh, sorry, you're right. I was too hasty. One approach that I've used for joining Avro inputs is to use regular Hadoop mappers and reducers (instead of AvroMapper/AvroReducer) with MultipleInputs and AvroInputFormat. Your mapper input key type is then AvroWrapper, and mapper input value type is NullWritable. This approach uses Hadoop sequence files (rather than Avro files) between mappers and reducers, so you have to take care of serializing mapper output and unserializing reducer input yourself. It works, but you have to write quite a bit of annoying boilerplate code. I'd also be interested if anyone has a better solution. Perhaps we just need to create the AvroMultipleInputs that I thought existed, but doesn't :) Martin On 24 April 2013 12:02, Sripad Sriram wrote: > Hey Martin, > > I think those classes refer to outputting to multiple files rather than > reading from multiple files, which is what's needed for a reduce-side join. > > thanks, > Sripad > > > On Wed, Apr 24, 2013 at 3:35 AM, Martin Kleppmann wrote: > >> Hey Sripad, >> >> Take a look at AvroMultipleInputs. >> >> http://avro.apache.org/docs/1.7.4/api/java/org/apache/avro/mapred/AvroMultipleOutputs.html(mapred version) >> >> http://avro.apache.org/docs/1.7.4/api/java/org/apache/avro/mapreduce/AvroMultipleOutputs.html(mapreduce version) >> >> Martin >> >> >> On 23 April 2013 17:01, Sripad Sriram wrote: >> >>> Hey folks, >>> >>> Aware that I can use Pig, Hive, etc to join avro files together, but I >>> have several use cases where I need to perform a reduce-side join on two >>> avro files. MultipleInputs doesn't seem to like AvroInputFormat - any >>> thoughts? >>> >>> thanks! >>> Sripad >>> >> >> > --bcaec54862fab272c104db310c16 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Thanks! Martin, would you happen t= o have a gist of an example? Did you mean the reducer input is NullWritable= ?

On Apr 25, 2013, at 7:44 AM, Martin Kleppmann <martin@rapportive.com> wrote:

=
Oh, sorry, you're right= . I was too hasty.

One approach that I've used for joining Avro inputs is t= o use regular Hadoop mappers and reducers (instead of AvroMapper/AvroReduce= r) with MultipleInputs and AvroInputFormat. Your mapper input key type is t= hen AvroWrapper<GenericRecord>, and mapper input value type is NullWr= itable. This approach uses Hadoop sequence files (rather than Avro files) b= etween mappers and reducers, so you have to take care of serializing mapper= output and unserializing reducer input yourself. It works, but you have to= write quite a bit of annoying boilerplate code.

I'd also be interested if anyone has a = better solution. Perhaps we just need to create the AvroMultipleInputs that= I thought existed, but doesn't :)

Martin


On 24 April 2013 12:02, Sripad Sriram <sripad@path.com> w= rote:
Hey Martin,

<= div>I think those classes refer to outputting to multiple files rather than= reading from multiple files, which is what's needed for a reduce-side = join.

thanks,
Sripad


O= n Wed, Apr 24, 2013 at 3:35 AM, Martin Kleppmann <martin@rapportive.co= m> wrote:
Hey Sripad,

<= div>Take a look at AvroMultipleInputs.

Marti= n


On 23 April 2013 17:01, Sripad Sriram <sripad@path.c= om> wrote:
Hey folks,

Aware that I can use Pig, Hive, etc to join avro files together, but I h= ave several use cases where I need to perform a reduce-side join on two avr= o files. MultipleInputs doesn't seem to like AvroInputFormat - any thou= ghts?

thanks!
Sripad<= /div>



--bcaec54862fab272c104db310c16--