Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: domain of dino.keco@gmail.com
 designates 209.85.215.48 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAAwryDWO52vmzuA9uBTAKqQnwVG9OuS9ijLeDkNLFSKAnA7m4w@mail.gmail.com>
References: 
 <CAAwryDWO52vmzuA9uBTAKqQnwVG9OuS9ijLeDkNLFSKAnA7m4w@mail.gmail.com>
Date: Wed, 10 Aug 2011 18:20:26 +0200
Message-ID: 
 <CAA2hHbeGVqeZkV-sYQ7TFMD2OrfmopRvE1zQhUk-uVnfw0fptw@mail.gmail.com>
Subject: Re: Multiple input formats and multiple output formats in Hadoop
 0.20.2
From: =?UTF-8?Q?Dino_Ke=C4=8Do?= <dino.keco@gmail.com>
To: mapreduce-user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=0015174be73ed7d0de04aa290e25

--0015174be73ed7d0de04aa290e25
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Hi John,

I think this is what are you looking for:

http://archive.cloudera.com/cdh/3/hadoop/api/org/apache/hadoop/mapreduce/li=
b/input/MultipleInputs.html

http://archive.cloudera.com/cdh/3/hadoop/api/org/apache/hadoop/mapreduce/li=
b/output/MultipleOutputs.html

Examples of usages are part of API doc.

Regards,
Dino Ke=C4=8Do


On Wed, Aug 10, 2011 at 6:08 PM, Jian Fang <jian.fang.subscribe@gmail.com>w=
rote:

> Hi,
>
> I am working on a project, which requires multiple input formats and
> multiple output formats. Basically, I store some sales rank data to a
> Cassandra cluster and I get a sales rank update file each day to update t=
he
> ranks in the Cassandra. In the meanwhile, I need to find all the products
> whose rank change exceeds a threshold and output them to a file. That is =
to
> say, I need two input formats, one from the file system (sales rank updat=
e
> file) and one from the Cassandra (current sales rank), and two output
> formats, one to the file system (products whose rank change exceeds a
> threshold) and one to Cassandra (write the new rank to Cassandra).
>
> Right now, I used multiple cascading jobs to do the work and use HDFS to
> share data among jobs. But this is not very efficient since some
> intermediate files need to be read multiple times in different jobs. I
> wonder if there is a more elegant way to solve this problem. Seems Hadoop
> 0.19 supports multiple input/output formats. It would be great if I could
> merge the multiple jobs to one with multiple input formats and multiple
> output formats. Is this doable in Hadoop 0.20.2?  Are there any examples =
of
> multiple input formats and multiple output formats for Hadoop 0.20.2?
>
> Thanks in advance,
>
> John
>
>

--0015174be73ed7d0de04aa290e25
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Hi John,<div><br></div><div>I think this is what are you looking for:</div>=
<div><br></div><div><a href=3D"http://archive.cloudera.com/cdh/3/hadoop/api=
/org/apache/hadoop/mapreduce/lib/input/MultipleInputs.html">http://archive.=
cloudera.com/cdh/3/hadoop/api/org/apache/hadoop/mapreduce/lib/input/Multipl=
eInputs.html</a></div>
<div><br></div><div><a href=3D"http://archive.cloudera.com/cdh/3/hadoop/api=
/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html">http://archiv=
e.cloudera.com/cdh/3/hadoop/api/org/apache/hadoop/mapreduce/lib/output/Mult=
ipleOutputs.html</a></div>
<div><br></div><div>Examples of usages are part of API doc.</div><div><br><=
/div><div>Regards,<br clear=3D"all">Dino Ke=C4=8Do <br>
<br><br><div class=3D"gmail_quote">On Wed, Aug 10, 2011 at 6:08 PM, Jian Fa=
ng <span dir=3D"ltr">&lt;<a href=3D"mailto:jian.fang.subscribe@gmail.com">j=
ian.fang.subscribe@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"=
gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-=
left:1ex;">
Hi,<br><br>I am working on a project, which requires multiple input formats=
 and multiple output formats. Basically, I store some sales rank data to a =
Cassandra cluster and I get a sales rank update file each day to update the=
 ranks in the Cassandra. In the meanwhile, I need to find all the products =
whose rank change exceeds a threshold and output them to a file. That is to=
 say, I need two input formats, one from the file system (sales rank update=
 file) and one from the Cassandra (current sales rank), and two output form=
ats, one to the file system (products whose rank change exceeds a threshold=
) and one to Cassandra (write the new rank to Cassandra). <br>

<br>Right now, I used multiple cascading jobs to do the work and use HDFS t=
o share data among jobs. But this is not very efficient since some intermed=
iate files need to be read multiple times in different jobs. I wonder if th=
ere is a more elegant way to solve this problem. Seems Hadoop 0.19 supports=
 multiple input/output formats. It would be great if I could merge the mult=
iple jobs to one with multiple input formats and multiple output formats. I=
s this doable in Hadoop 0.20.2?=C2=A0 Are there any examples of multiple in=
put formats and multiple output formats for Hadoop 0.20.2?<br>

<br>Thanks in advance,<br><br>John<br><br>
</blockquote></div><br></div>

--0015174be73ed7d0de04aa290e25--