hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Panagiotis Antonopoulos <antonopoulos...@hotmail.com>
Subject RE: Problem when using MultipleOutputs with many files
Date Mon, 05 Sep 2011 14:41:17 GMT

Thanks both of you!

Harsh must be right.
The source file of the Hbase version that I use seems to have the changes mentioned at https://issues.apache.org/jira/browse/MAPREDUCE-1853

> From: harsh@cloudera.com
> Date: Fri, 2 Sep 2011 23:03:09 +0530
> Subject: Re: Problem when using MultipleOutputs with many files
> To: mapreduce-user@hadoop.apache.org
> 
> Hello David,
> 
> MAPREDUCE-1853 was back-ported already into CDH3u0 [1]. That shouldn't
> be the cause of Panagiotis's performance breaker, hence.
> 
> P.s. Please do not upgrade to 0.21.x series in production, as it is
> not deemed stable yet. This is noted on the Apache Hadoop website as
> well.
> 
> [1] - http://archive.cloudera.com/cdh/3/hadoop-0.20.2+923.21.releasenotes.html
> 
> On Fri, Sep 2, 2011 at 8:39 PM, David Rosenstrauch <darose@darose.net> wrote:
> > On 09/02/2011 09:14 AM, Panagiotis Antonopoulos wrote:
> >>
> >> Hello guys,
> >>
> >> I am using hadoop-0.20.2-cdh3u0 and I use MultipleOutputs to divide the
> >> HFiles (which are the output of my MR job) so that each file can fit into
> >> one region of the table where I am going to bulk load them.
> >>
> >> Therefore I have one MultipleOutput per region and as a result I had 280
> >> different outputs.
> >> I just realized that using so many outputs makes my job a lot slower than
> >> it is when I have just one output.
> >>
> >> Do you know what goes wrong? Has anyone noticed the same?
> >>
> >> Thank you!
> >> Panagiotis
> >
> >
> > You're probably running into this bug, which crushes the performance of
> > MultipleOutputs:
> >
> > https://issues.apache.org/jira/browse/MAPREDUCE-1853
> >
> > Apparently it's fixed in v0.21, so try to upgrade if you can.
> >
> > I wasn't able to in our code however (we were also using Cloudera CDH, which
> > as you see is 0.20).  What I eventually wound up doing to work around it was
> > to use our own local copy of the MultipleOutputs class (I called it
> > BugFixMultipleOutputs_0_20) which I manually patched with the fix.
> >
> > HTH,
> >
> > DR
> >
> 
> 
> 
> -- 
> Harsh J
 		 	   		  
Mime
View raw message