hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: problem when using combiner and MultipleOutputFormat
Date Fri, 28 Oct 2011 13:44:37 GMT
Xin,

No problem. Glad to know you managed to hunt it down and fix it! :)

On 28-Oct-2011, at 6:11 PM, Xin Jing wrote:

> Hi Harsh,
> 
> You are right, I went through my combiner code again carefully, and found that there
is a bug in my code. For short, I add a type for each value, and check the type before processing.
The type before and after multiple combiner iteration is not the same, so...
> 
> I found the root cause of the issue, thanks for your help, Harsh.
> ________________________________________
> From: Xin Jing [xinjing@beyondfun.net]
> Sent: Friday, October 28, 2011 2:43 PM
> To: mapreduce-user@hadoop.apache.org
> Subject: RE: problem when using combiner and MultipleOutputFormat
> 
> OK, I will prepare my code to show how it works.
> 
> Here another question, my combiner DOES output some records, if it the reason as you
said, my combiner's behavior is wrong when run one time and multiple times, the reason that
reduce cannot receive input is because of the wrong data format of the combiner?
> 
> I attach the job result, please take a look.
> ________________________________________
> From: Harsh J [harsh@cloudera.com]
> Sent: Friday, October 28, 2011 2:33 PM
> To: mapreduce-user@hadoop.apache.org
> Subject: Re: problem when using combiner and MultipleOutputFormat
> 
> Xin,
> 
> I guess am now confused. I was thinking that your combiner was a
> reused reducer, and that was probably eating away all the outputs when
> run (Combiners run 0…N times (yes, 0 is possible), so that may explain
> your low vs. high data volume difference).
> 
> I'm sure the issue is with your combiner implementation, but I'd have
> to see some mock code to tell you  more, as am unsure on what you are
> attempting to do in there.
> 
> On Fri, Oct 28, 2011 at 11:52 AM, Xin Jing <xinjing@beyondfun.net> wrote:
>> Thanks for your answer.
>> 
>> I am using different combiner and reducer. As I have said in previous mail, when
the data set is small, it works fine and the result is correct. I can tell the functionality
of my job is ok, right?
>> 
>> I cannot understand what do you mean by ' Do not output to files directly from your
combiner', could you give me more hints? I combiner code, I am using output.collect() to output
my result, do I misuse it?
>> ________________________________________
>> From: Harsh J [harsh@cloudera.com]
>> Sent: Friday, October 28, 2011 2:11 PM
>> To: mapreduce-user@hadoop.apache.org
>> Subject: Re: problem when using combiner and MultipleOutputFormat
>> 
>> Xin,
>> 
>> You probably just need to write a special Combiner class instead of
>> reusing your Reducer class for combiner purposes. In an MR job, you
>> need to specifically guarantee that the combiner outputs the same type
>> of K-V pairs as the reducer's input. Do not output to files directly
>> from your combiner, and that is why you'd need a different class impl.
>> performing the optimization.
>> 
>> On Fri, Oct 28, 2011 at 10:04 AM, Xin Jing <xinjing@beyondfun.net> wrote:
>>> 
>>> Hi all,
>>> I am currently encountering a tough problem, my job use MultipleOutputFormat
>>> to output result into different folder, and I have to use a combiner to
>>> enhance performance. In this situation, reduce does not work, reduce cannot
>>> receive any data. I searched this issue and found a related
>>> topic, http://lucene.472066.n3.nabble.com/Combiner-and-MultipleOutputs-in-Mapreduce-td1640503.html
,
>>> but not get clear what the solution is really. Seems it is the constraint of
>>> hadoop framework?
>>> I found a interesting phenomenon, when I limit the map input record to a
>>> small number (such as 10000), the reduce is ok, it can receive data and the
>>> result is correct. But when the input is over a million record, the reduce
>>> receive nothing. I guess the reason is the combiner only be called once when
>>> data is small while combiner be called multiple time when data is huge.
>>> To summary, how can I make combiner feasible  while using
>>> MultipleOutputFormat? Any solution or suggestion is welcome.
>>> 
>>> Thanks
>>> 
>> 
>> 
>> 
>> --
>> Harsh J
>> 
>> 
>> 
> 
> 
> 
> --
> Harsh J
> 
> 


Mime
View raw message