Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4E195824A for ; Wed, 10 Aug 2011 16:20:55 +0000 (UTC) Received: (qmail 9556 invoked by uid 500); 10 Aug 2011 16:20:54 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 9476 invoked by uid 500); 10 Aug 2011 16:20:53 -0000 Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-user@hadoop.apache.org Delivered-To: mailing list mapreduce-user@hadoop.apache.org Received: (qmail 9468 invoked by uid 99); 10 Aug 2011 16:20:53 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Aug 2011 16:20:53 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of dino.keco@gmail.com designates 209.85.215.48 as permitted sender) Received: from [209.85.215.48] (HELO mail-ew0-f48.google.com) (209.85.215.48) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Aug 2011 16:20:48 +0000 Received: by ewy22 with SMTP id 22so875925ewy.35 for ; Wed, 10 Aug 2011 09:20:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=Tfywh0zbGinc9DIFVWngCMS1+bv+J7GiGwR7R6Wou9o=; b=JdVSurKtoQYHE0XhkHVVy+qVcxWK1o4stVlmhXwa3PZ3WP+yiI6PW3kfScHo9oHCb4 CSV+bJBDLTX7EyX30A4FT14Zpb/5vjw/bg+nQXbkAoo/v0xDZy+ZQbWK32AuH+pRVSGS lspgNqUV6DNEd39lbkLHw/Ruq9ISB4Aab6PmI= MIME-Version: 1.0 Received: by 10.213.22.10 with SMTP id l10mr254199ebb.95.1312993226248; Wed, 10 Aug 2011 09:20:26 -0700 (PDT) Received: by 10.213.13.140 with HTTP; Wed, 10 Aug 2011 09:20:26 -0700 (PDT) In-Reply-To: References: Date: Wed, 10 Aug 2011 18:20:26 +0200 Message-ID: Subject: Re: Multiple input formats and multiple output formats in Hadoop 0.20.2 From: =?UTF-8?Q?Dino_Ke=C4=8Do?= To: mapreduce-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=0015174be73ed7d0de04aa290e25 --0015174be73ed7d0de04aa290e25 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi John, I think this is what are you looking for: http://archive.cloudera.com/cdh/3/hadoop/api/org/apache/hadoop/mapreduce/li= b/input/MultipleInputs.html http://archive.cloudera.com/cdh/3/hadoop/api/org/apache/hadoop/mapreduce/li= b/output/MultipleOutputs.html Examples of usages are part of API doc. Regards, Dino Ke=C4=8Do On Wed, Aug 10, 2011 at 6:08 PM, Jian Fang w= rote: > Hi, > > I am working on a project, which requires multiple input formats and > multiple output formats. Basically, I store some sales rank data to a > Cassandra cluster and I get a sales rank update file each day to update t= he > ranks in the Cassandra. In the meanwhile, I need to find all the products > whose rank change exceeds a threshold and output them to a file. That is = to > say, I need two input formats, one from the file system (sales rank updat= e > file) and one from the Cassandra (current sales rank), and two output > formats, one to the file system (products whose rank change exceeds a > threshold) and one to Cassandra (write the new rank to Cassandra). > > Right now, I used multiple cascading jobs to do the work and use HDFS to > share data among jobs. But this is not very efficient since some > intermediate files need to be read multiple times in different jobs. I > wonder if there is a more elegant way to solve this problem. Seems Hadoop > 0.19 supports multiple input/output formats. It would be great if I could > merge the multiple jobs to one with multiple input formats and multiple > output formats. Is this doable in Hadoop 0.20.2? Are there any examples = of > multiple input formats and multiple output formats for Hadoop 0.20.2? > > Thanks in advance, > > John > > --0015174be73ed7d0de04aa290e25 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi John,

I think this is what are you looking for:
=



Examples of usages are part of API doc.

<= /div>
Regards,
Dino Ke=C4=8Do


On Wed, Aug 10, 2011 at 6:08 PM, Jian Fa= ng <j= ian.fang.subscribe@gmail.com> wrote:
Hi,

I am working on a project, which requires multiple input formats= and multiple output formats. Basically, I store some sales rank data to a = Cassandra cluster and I get a sales rank update file each day to update the= ranks in the Cassandra. In the meanwhile, I need to find all the products = whose rank change exceeds a threshold and output them to a file. That is to= say, I need two input formats, one from the file system (sales rank update= file) and one from the Cassandra (current sales rank), and two output form= ats, one to the file system (products whose rank change exceeds a threshold= ) and one to Cassandra (write the new rank to Cassandra).

Right now, I used multiple cascading jobs to do the work and use HDFS t= o share data among jobs. But this is not very efficient since some intermed= iate files need to be read multiple times in different jobs. I wonder if th= ere is a more elegant way to solve this problem. Seems Hadoop 0.19 supports= multiple input/output formats. It would be great if I could merge the mult= iple jobs to one with multiple input formats and multiple output formats. I= s this doable in Hadoop 0.20.2?=C2=A0 Are there any examples of multiple in= put formats and multiple output formats for Hadoop 0.20.2?

Thanks in advance,

John


--0015174be73ed7d0de04aa290e25--