Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A46EDD9EE for ; Mon, 4 Mar 2013 19:43:58 +0000 (UTC) Received: (qmail 27454 invoked by uid 500); 4 Mar 2013 19:43:53 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 27335 invoked by uid 500); 4 Mar 2013 19:43:53 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 27328 invoked by uid 99); 4 Mar 2013 19:43:53 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 Mar 2013 19:43:53 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [74.125.82.44] (HELO mail-wg0-f44.google.com) (74.125.82.44) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 Mar 2013 19:43:47 +0000 Received: by mail-wg0-f44.google.com with SMTP id dr12so4506337wgb.35 for ; Mon, 04 Mar 2013 11:43:26 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:mime-version:in-reply-to:references:from:date:message-id :subject:to:content-type:x-gm-message-state; bh=jzRuKgf/Lxx38QGAGGpk9di/MIVEsNzdSiui9xzRB9g=; b=O7NFV0EVbs5C6F8kqwvFgg668IojNf2MW/uQy95C0LOxjGU4TZQx7OnNkPK4RtyVe+ ALTKXPGYjoqAXzANd+2OuuWmYRLuoP3DnteYjDmrIMsvy6+bdO2BC2UW6QIMugdRyuHa Y7WEe/Fz0WcssqNroC1ssHIAfJk0F2PekYlT9jNSGoHQs+VS3LP0yRWaMTE2aUXFRlol poWPxcQ5ySUP9WWOXluHnqS8V2fEY2dB3H6g1RgQCwHuvKpfmKTATV57RTM+3/WBv7Hy 840BjMlfxwCXrV2PLGXclxEj181jINKvfYrbca8UL1QCdusbWh68Vj972g+J1v2PSOXO KkrQ== X-Received: by 10.194.63.240 with SMTP id j16mr34285500wjs.45.1362426206226; Mon, 04 Mar 2013 11:43:26 -0800 (PST) MIME-Version: 1.0 Received: by 10.194.38.162 with HTTP; Mon, 4 Mar 2013 11:43:06 -0800 (PST) In-Reply-To: References: <4408146210581464893@unknownmsgid> From: Ted Dunning Date: Mon, 4 Mar 2013 14:43:06 -0500 Message-ID: Subject: Re: Accumulo and Mapreduce To: "common-user@hadoop.apache.org" Content-Type: multipart/alternative; boundary=047d7ba97b720e13c804d71e9294 X-Gm-Message-State: ALoCoQlknEfw0mJICew/wl7O+Kipi/ueNFZ2XtP/vDLauDJi53tf5Ko0DhHFch0yapxdEEwjms9f X-Virus-Checked: Checked by ClamAV on apache.org --047d7ba97b720e13c804d71e9294 Content-Type: text/plain; charset=ISO-8859-1 Chaining the jobs is a fantastically inefficient solution. If you use Pig or Cascading, the optimizer will glue all of your map functions into a single mapper. The result is something like: (mapper1 -> mapper2 -> mapper3) => reducer Here the parentheses indicate that all of the map functions are executed as a single function formed by composing mapper1, mapper2, and mapper3. Writing multiple jobs to do this forces *lots* of unnecessary traffic to your persistent store and lots of unnecessary synchronization. You can do this optimization by hand, but using a higher level language is often better for maintenance. On Mon, Mar 4, 2013 at 1:52 PM, Russell Jurney wrote: > You can chain MR jobs with Oozie, but would suggest using Cascading, Pig > or Hive. You can do this is a couple lines of code, I suspect. Two map > reduce jobs should not pose any kind of challenge with the right tools. > > > On Monday, March 4, 2013, Sandy Ryza wrote: > >> Hi Aji, >> >> Oozie is a mature project for managing MapReduce workflows. >> http://oozie.apache.org/ >> >> -Sandy >> >> >> On Mon, Mar 4, 2013 at 8:17 AM, Justin Woody wrote: >> >>> Aji, >>> >>> Why don't you just chain the jobs together? >>> http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining >>> >>> Justin >>> >>> On Mon, Mar 4, 2013 at 11:11 AM, Aji Janis wrote: >>> > Russell thanks for the link. >>> > >>> > I am interested in finding a solution (if out there) where Mapper1 >>> outputs a >>> > custom object and Mapper 2 can use that as input. One way to do this >>> > obviously by writing to Accumulo, in my case. But, is there another >>> solution >>> > for this: >>> > >>> > List ----> Input to Job >>> > >>> > MyObject ---> Input to Mapper1 (process MyObject) ----> Output >>> >> > MyObject> >>> > >>> > are Input to Mapper2 ... and so on >>> > >>> > >>> > >>> > Ideas? >>> > >>> > >>> > On Mon, Mar 4, 2013 at 10:00 AM, Russell Jurney < >>> russell.jurney@gmail.com> >>> > wrote: >>> >> >>> >> >>> >> >>> http://svn.apache.org/repos/asf/accumulo/contrib/pig/trunk/src/main/java/org/apache/accumulo/pig/AccumuloStorage.java >>> >> >>> >> AccumuloStorage for Pig comes with Accumulo. Easiest way would be to >>> try >>> >> it. >>> >> >>> >> Russell Jurney http://datasyndrome.com >>> >> >>> >> On Mar 4, 2013, at 5:30 AM, Aji Janis wrote: >>> >> >>> >> Hello, >>> >> >>> >> I have a MR job design with a flow like this: Mapper1 -> Mapper2 -> >>> >> Mapper3 -> Reducer1. Mapper1's input is an accumulo table. M1's >>> output goes >>> >> to M2.. and so on. Finally the Reducer writes output to Accumulo. >>> >> >>> >> Questions: >>> >> >>> >> 1) Has any one tried something like this before? Are there any >>> workflow >>> >> control apis (in or outside of Hadoop) that can help me set up the >>> job like >>> >> this. Or am I limited to use Quartz for this? >>> >> 2) If both M2 and M3 needed to write some data to two same tables in >>> >> Accumulo, is it possible to do so? Are there any good accumulo >>> mapreduce >>> >> jobs you can point me to? blogs/pages that I can use for reference >>> (starting >>> >> point/best practices). >>> >> >>> >> Thank you in advance for any suggestions! >>> >> >>> >> Aji >>> >> >>> > >>> >> >> > > -- > Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome. > com > --047d7ba97b720e13c804d71e9294 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Chaining the jobs is a fantastically inefficient solution.= =A0If you use Pig or Cascading, the optimizer will glue all of your map fu= nctions into a single mapper. =A0The result is something like:

=A0 =A0 (mapper1 -> mapper2 -> mapper3) =3D> reducer

Here the parentheses indicate that all of the map f= unctions are executed as a single function formed by composing mapper1, map= per2, and mapper3. =A0Writing multiple jobs to do this forces *lots* of unn= ecessary traffic to your persistent store and lots of unnecessary synchroni= zation.

You can do this optimization by hand, but u= sing a higher level language is often better for maintenance.


On Mon, Mar 4, = 2013 at 1:52 PM, Russell Jurney <russell.jurney@gmail.com> wrote:
You can chain MR jobs with Oozie, but would = suggest using Cascading,=A0Pig or Hive. You can do this is a c= ouple lines of code, I suspect. Two map reduce jobs should not pose any kin= d of challenge with the right tools.


On Monday, March 4, 2013, Sandy Ryza wrote:
Hi Aji,

Oozie is a mature project for managing M= apReduce workflows.

-Sandy


On Mon, Mar 4, 2013 at 8:17 AM, Justin Woody <justi= n.woody@gmail.com> wrote:
Aji,

Why don't you just chain the jobs together?
http://developer.yahoo.com/hadoop/tutorial/module4.html= #chaining

Justin

On Mon, Mar 4, 2013 at 11:11 AM, Aji Janis <aji1705@gmail.com>= wrote:
> Russell thanks for the link.
>
> I am interested in finding a solution (if out there) where Mapper1 out= puts a
> custom object and Mapper 2 can use that as input. One way to do this > obviously by writing to Accumulo, in my case. But, is there another so= lution
> for this:
>
> List<MyObject> ----> Input to Job
>
> MyObject ---> Input to Mapper1 (process MyObject) ----> Output &= lt;MyObjectId,
> MyObject>
>
> <MyObjectId, MyObject> are Input to Mapper2 ... and so on
>
>
>
> Ideas?
>
>
> On Mon, Mar 4, 2013 at 10:00 AM, Russell Jurney <russell.jurney@= gmail.com>
> wrote:
>>
>>
>> http://svn.apache.org/repos/asf/accumulo/contrib/pig/trunk/src/main/= java/org/apache/accumulo/pig/AccumuloStorage.java
>>
>> AccumuloStorage for Pig comes with Accumulo. Easiest way would be = to try
>> it.
>>
>> Russell Jurney http://datasyndrome.com
>>
>> On Mar 4, 2013, at 5:30 AM, Aji Janis <aji1705@gmail.com= > wrote:
>>
>> Hello,
>>
>> =A0I have a MR job design with a flow like this: Mapper1 -> Map= per2 ->
>> Mapper3 -> Reducer1. Mapper1's input is an accumulo table. = M1's output goes
>> to M2.. and so on. Finally the Reducer writes output to Accumulo.<= br> >>
>> Questions:
>>
>> 1) Has any one tried something like this before? Are there any wor= kflow
>> control apis (in or outside of Hadoop) that can help me set up the= job like
>> this. Or am I limited to use Quartz for this?
>> 2) If both M2 and M3 needed to write some data to two same tables = in
>> Accumulo, is it possible to do so? Are there any good accumulo map= reduce
>> jobs you can point me to? blogs/pages that I can use for reference= (starting
>> point/best practices).
>>
>> Thank you in advance for any suggestions!
>>
>> Aji
>>
>



--
Rus= sell Jurney=A0twitter.com/rjurney=A0russell.jurney@gmail.com=A0datasyndrom= e.com

--047d7ba97b720e13c804d71e9294--