Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 124B310DC1 for ; Tue, 17 Sep 2013 13:24:40 +0000 (UTC) Received: (qmail 72776 invoked by uid 500); 17 Sep 2013 13:24:11 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 72661 invoked by uid 500); 17 Sep 2013 13:24:08 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 72141 invoked by uid 99); 17 Sep 2013 13:24:06 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Sep 2013 13:24:06 +0000 X-ASF-Spam-Status: No, hits=1.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of chivas314159@gmail.com designates 209.85.128.47 as permitted sender) Received: from [209.85.128.47] (HELO mail-qe0-f47.google.com) (209.85.128.47) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Sep 2013 13:24:01 +0000 Received: by mail-qe0-f47.google.com with SMTP id b4so3646810qen.34 for ; Tue, 17 Sep 2013 06:23:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=MATu8zLNA2GTg/LwSDmI6gm5ESswcR9q8fK2q5k1Te8=; b=tsYuQjMVGXVAOHlcS2ZcAtS4ddpvmybAqNWWfZqbZHthJbzS8hiJZV74w1Gj/QcVYw sv2hj+C9Dqf2lbYPhLm6ZKPmXFAm4fkwo+FkaDtaRSIBTkMYlcGANsEL5wTlTc8kv/Mp o3LRP3tya8a50yg1BlBZgf76xIetK7hDAjVAIM18eK1S/KRQ6mRxrT3qGmaM+ENPT2G+ I4q0mtsL0wnJL3ygFfNDiUpw7GAXqEnxfsDRgk9aqPq1cn/PXNDu+7xbzWq+5z7c+IF3 tFUhdj2IvqjfS6w+DWRLOsTLxUB4LSyKZQ3YcI3GyMaYLPnMfiAufOWa+U1oF4qw2P2k AqdA== MIME-Version: 1.0 X-Received: by 10.49.3.194 with SMTP id e2mr53746375qee.21.1379424220395; Tue, 17 Sep 2013 06:23:40 -0700 (PDT) Received: by 10.229.176.195 with HTTP; Tue, 17 Sep 2013 06:23:40 -0700 (PDT) In-Reply-To: <8ED5ED25-2DE0-43FC-A1F8-F37F79266AEA@apache.org> References: <8ED5ED25-2DE0-43FC-A1F8-F37F79266AEA@apache.org> Date: Tue, 17 Sep 2013 14:23:40 +0100 Message-ID: Subject: Re: chaining (the output of) jobs/ reducers From: Adrian CAPDEFIER To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=047d7bb04734a6cc1304e6943a0b X-Virus-Checked: Checked by ClamAV on apache.org --047d7bb04734a6cc1304e6943a0b Content-Type: text/plain; charset=ISO-8859-1 I've just seen your email, Vinod. This is the behaviour that I'd expect and similar to other data integration tools; I will keep an eye out for it as a long term option. On Fri, Sep 13, 2013 at 5:26 AM, Vinod Kumar Vavilapalli wrote: > > Other than the short term solutions that others have proposed, Apache Tez > solves this exact problem. It can M-M-R-R-R chains, and mult-way mappers > and reducers, and your own custom processors - all without persisting the > intermediate outputs to HDFS. > > It works on top of YARN, though the first release of Tez is yet to happen. > > You can learn about it more here: http://tez.incubator.apache.org/ > > HTH, > +Vinod > > On Sep 12, 2013, at 6:36 AM, Adrian CAPDEFIER wrote: > > Howdy, > > My application requires 2 distinct processing steps (reducers) to be > performed on the input data. The first operation generates changes the key > values and, records that had different keys in step 1 can end up having the > same key in step 2. > > The heavy lifting of the operation is in step1 and step2 only combines > records where keys were changed. > > In short the overview is: > Sequential file -> Step 1 -> Step 2 -> Output. > > > To implement this in hadoop, it seems that I need to create a separate job > for each step. > > Now I assumed, there would some sort of job management under hadoop to > link Job 1 and 2, but the only thing I could find was related to job > scheduling and nothing on how to synchronize the input/output of the linked > jobs. > > > > The only crude solution that I can think of is to use a temporary file > under HDFS, but even so I'm not sure if this will work. > > The overview of the process would be: > Sequential Input (lines) => Job A[Mapper (key1, value1) => ChainReducer > (key2, value2)] => Temporary file => Job B[Mapper (key2, value2) => Reducer > (key2, value 3)] => output. > > Is there a better way to pass the output from Job A as input to Job B > (e.g. using network streams or some built in java classes that don't do > disk i/o)? > > > > The temporary file solution will work in a single node configuration, but > I'm not sure about an MPP config. > > Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and 3 or > both jobs run on all 4 nodes - will HDFS be able to redistribute > automagically the records between nodes or does this need to be coded > somehow? > > > > CONFIDENTIALITY NOTICE > NOTICE: This message is intended for the use of the individual or entity > to which it is addressed and may contain information that is confidential, > privileged and exempt from disclosure under applicable law. If the reader > of this message is not the intended recipient, you are hereby notified that > any printing, copying, dissemination, distribution, disclosure or > forwarding of this communication is strictly prohibited. If you have > received this communication in error, please contact the sender immediately > and delete it from your system. Thank You. --047d7bb04734a6cc1304e6943a0b Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
I've just seen your email, Vinod. This is the behaviou= r that I'd expect and similar to other data integration tools; I will k= eep an eye out for it as a long term option.


On Fri, Sep 13, 2013 at 5:26 AM, Vinod K= umar Vavilapalli <vinodkv@apache.org> wrote:

Other than the shor= t term solutions that others have proposed, Apache Tez solves this exact pr= oblem. It can M-M-R-R-R chains, and mult-way mappers and reducers, and your= own custom processors - all without persisting the intermediate outputs to= HDFS.

It works on top of YARN, though the first release of Te= z is yet to happen.

You can learn about it more here:= =A0http://te= z.incubator.apache.org/

HTH,
+Vinod

On Sep 12, 2013, at 6:36 AM, Adrian CAPDEFIER wrote:
Howdy,

My application requires 2 distinct processing steps (reducers) to be pe= rformed on the input data. The first operation generates changes the key va= lues and, records that had different keys in step 1 can end up having the s= ame key in step 2.

The heavy lifting of the operation is in step1 and step2 onl= y combines records where keys were changed.

In= short the overview is:
Sequential file -> Step 1 -> St= ep 2 -> Output.


To implement this in hadoop, it seems that I need to cre= ate a separate job for each step.

Now I assumed, there would some s= ort of job management under hadoop to=20 link Job 1 and 2, but the only thing I could find was related to job=20 scheduling and nothing on how to synchronize the input/output of the linked= jobs.



The only crude solution that I can think of is to use= a temporary file under HDFS, but even so I'm not sure if this will wor= k.

The overview of the process would be:
Sequential Input (l= ines) =3D> Job A[Mapper (key1, value1) =3D> ChainReducer (key2, value= 2)] =3D> Temporary file =3D> Job B[Mapper (key2, value2) =3D> Redu= cer (key2, value 3)] =3D> output.

Is there a better way to pass the output from Job A as= input to Job B (e.g. using network streams or some built in java classes t= hat don't do disk i/o)?



The temporary file solution wil= l work in a single node configuration, but I'm not sure about an MPP co= nfig.

Let's say Job A runs on nodes 0 and 1 and job B runs on = nodes 2 and 3 or both jobs run on all 4 nodes - will HDFS be able to redist= ribute automagically the records between nodes or does this need to be code= d somehow?


CONFIDENTIALITY NOTICE
NOTICE: This message is = intended for the use of the individual or entity to which it is addressed a= nd may contain information that is confidential, privileged and exempt from= disclosure under applicable law. If the reader of this message is not the = intended recipient, you are hereby notified that any printing, copying, dis= semination, distribution, disclosure or forwarding of this communication is= strictly prohibited. If you have received this communication in error, ple= ase contact the sender immediately and delete it from your system. Thank Yo= u.

--047d7bb04734a6cc1304e6943a0b--