hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Devaraj Das <d...@yahoo-inc.com>
Subject Re: combiner stats
Date Tue, 18 Nov 2008 14:18:59 GMT



On 11/18/08 6:36 PM, "Paco NATHAN" <ceteri@gmail.com> wrote:

> Thank you, Devaraj -
> That explanation helps a lot.
> 
> Is the following reasonable to say?
> 
>     Combine input records count shown in the Map phase column of the
> report is a measure of how many times records have passed through the
> Combiner during merges of intermediate spills. Therefore, it may be
> larger than the actual count of records which are being merged.
> 
> 

Yes, but to be precise you should say sorts and merges instead of just
merges (as you might know that map does a sort of the map output buffer data
whenever it has collected sufficient data, and the data that gets spilled to
disk are the records that the combiner outputs).

> Paco
> 
> 
>> On the map side, the combiner is called after sort and during the merges of
>> the intermediate spills. At the end a single spill file is generated. Note
>> that, during the merges, the same record may pass multiple times through the
>> combiner.
> 
> On Mon, Nov 17, 2008 at 23:04, Devaraj Das <ddas@yahoo-inc.com> wrote:
>> 
>> 
>> 
>> On 11/18/08 3:59 AM, "Paco NATHAN" <ceteri@gmail.com> wrote:
>> 
>>> Could someone please help explain the job counters shown for Combine
>>> records on the JobTracker JSP page?
>>> 
>>> Here's an example from one of our MR jobs.  There are Combine input
>>> and output record counters shown for both Map phase and Reduce phase.
>>> We're not quite sure how to interpret them -
>>> 
>>> Map Phase:
>>>    Map input records   85,013,261,279
>>>    Map output records   85,013,261,279
>>>    Combine input records   114,936,724,505
>>>    Combine output records   38,750,511,975
>>> 
>>> Reduce Phase:
>>>    Combine input records   8,827,017,275
>>>    Combine output records   17,986,654
>>>    Reduce input groups   2,221,796
>>>    Reduce input records   17,986,654
>>>    Reduce output records   4,443,590
>>> 
>>> 
>>> What makes sense:
>>>    * Considering the MR job and its data, the 85.0b count for Map
>>> output records is expected
>>>    * I would believe a rate of 85.0b / 38.8b = 2.2 for our combiner
>>>    * Reduce phase shows Combine output records at 18.0m = Reduce input
>>> records at 18.0m
>>>    * Reduce input groups at 2.2m is expected
>>>    * Reduce output records at 4.4m is verified
>>> 
>>> What doesn't make sense:
>>>    * The 115b count for Combine input records during Map phase
>>>    * The 8.8b count for Combine input records during Reduce phase
>>> 
>> 
>> On the map side, the combiner is called after sort and during the merges of
>> the intermediate spills. At the end a single spill file is generated. Note
>> that, during the merges, the same record may pass multiple times through the
>> combiner.
>> On the reducer side, the combiner would be called only during merges of
>> intermediate data, and the intermediate merges stops at a certain point (we
>> have <= io.sort.factor files remaining). Hence the combiner may be called
>> fewer times here...
>> 
>>> What would be the actual count of records coming out of the Map phase?
>>> 
>>> Thanks,
>>> Paco
>> 
>> 
>> 



Mime
View raw message