hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: Problem when using MultipleOutputs with many files
Date Fri, 02 Sep 2011 17:33:09 GMT
Hello David,

MAPREDUCE-1853 was back-ported already into CDH3u0 [1]. That shouldn't
be the cause of Panagiotis's performance breaker, hence.

P.s. Please do not upgrade to 0.21.x series in production, as it is
not deemed stable yet. This is noted on the Apache Hadoop website as

[1] - http://archive.cloudera.com/cdh/3/hadoop-0.20.2+923.21.releasenotes.html

On Fri, Sep 2, 2011 at 8:39 PM, David Rosenstrauch <darose@darose.net> wrote:
> On 09/02/2011 09:14 AM, Panagiotis Antonopoulos wrote:
>> Hello guys,
>> I am using hadoop-0.20.2-cdh3u0 and I use MultipleOutputs to divide the
>> HFiles (which are the output of my MR job) so that each file can fit into
>> one region of the table where I am going to bulk load them.
>> Therefore I have one MultipleOutput per region and as a result I had 280
>> different outputs.
>> I just realized that using so many outputs makes my job a lot slower than
>> it is when I have just one output.
>> Do you know what goes wrong? Has anyone noticed the same?
>> Thank you!
>> Panagiotis
> You're probably running into this bug, which crushes the performance of
> MultipleOutputs:
> https://issues.apache.org/jira/browse/MAPREDUCE-1853
> Apparently it's fixed in v0.21, so try to upgrade if you can.
> I wasn't able to in our code however (we were also using Cloudera CDH, which
> as you see is 0.20).  What I eventually wound up doing to work around it was
> to use our own local copy of the MultipleOutputs class (I called it
> BugFixMultipleOutputs_0_20) which I manually patched with the fix.
> HTH,
> DR

Harsh J

View raw message