hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: problem when using combiner and MultipleOutputFormat
Date Fri, 28 Oct 2011 06:33:07 GMT

I guess am now confused. I was thinking that your combiner was a
reused reducer, and that was probably eating away all the outputs when
run (Combiners run 0…N times (yes, 0 is possible), so that may explain
your low vs. high data volume difference).

I'm sure the issue is with your combiner implementation, but I'd have
to see some mock code to tell you  more, as am unsure on what you are
attempting to do in there.

On Fri, Oct 28, 2011 at 11:52 AM, Xin Jing <xinjing@beyondfun.net> wrote:
> Thanks for your answer.
> I am using different combiner and reducer. As I have said in previous mail, when the
data set is small, it works fine and the result is correct. I can tell the functionality of
my job is ok, right?
> I cannot understand what do you mean by ' Do not output to files directly from your combiner',
could you give me more hints? I combiner code, I am using output.collect() to output my result,
do I misuse it?
> ________________________________________
> From: Harsh J [harsh@cloudera.com]
> Sent: Friday, October 28, 2011 2:11 PM
> To: mapreduce-user@hadoop.apache.org
> Subject: Re: problem when using combiner and MultipleOutputFormat
> Xin,
> You probably just need to write a special Combiner class instead of
> reusing your Reducer class for combiner purposes. In an MR job, you
> need to specifically guarantee that the combiner outputs the same type
> of K-V pairs as the reducer's input. Do not output to files directly
> from your combiner, and that is why you'd need a different class impl.
> performing the optimization.
> On Fri, Oct 28, 2011 at 10:04 AM, Xin Jing <xinjing@beyondfun.net> wrote:
>> Hi all,
>> I am currently encountering a tough problem, my job use MultipleOutputFormat
>> to output result into different folder, and I have to use a combiner to
>> enhance performance. In this situation, reduce does not work, reduce cannot
>> receive any data. I searched this issue and found a related
>> topic, http://lucene.472066.n3.nabble.com/Combiner-and-MultipleOutputs-in-Mapreduce-td1640503.html
>> but not get clear what the solution is really. Seems it is the constraint of
>> hadoop framework?
>> I found a interesting phenomenon, when I limit the map input record to a
>> small number (such as 10000), the reduce is ok, it can receive data and the
>> result is correct. But when the input is over a million record, the reduce
>> receive nothing. I guess the reason is the combiner only be called once when
>> data is small while combiner be called multiple time when data is huge.
>> To summary, how can I make combiner feasible  while using
>> MultipleOutputFormat? Any solution or suggestion is welcome.
>> Thanks
> --
> Harsh J

Harsh J

View raw message