hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ed Mazur <ma...@cs.umass.edu>
Subject Re: How are intermediate key/value pairs materialized between map and reduce?
Date Tue, 23 Feb 2010 15:11:31 GMT
Hi Tim,

I'm guessing a lot of these writes are happening on the reduce side.
On the JT web interface, there are three columns: map, reduce,
overall. Is the 900GB figure from the overall column? The value in the
map column will probably be closer to what you were expecting. There
are writes on the reduce side too during the shuffle and multi-pass
merge.

Ed

2010/2/23 Tim Kiefer <tim-kiefer@gmx.de>:
> Hi Gang,
>
> thanks for your reply.
>
> To clarify: I look at the statistics through the job tracker. In the
> webinterface for my job I have columns for map, reduce and total. What I
> was refering to is "map" - i.e. I see FILE_BYTES_WRITTEN = 3 * Map
> Output Bytes in the map column.
>
> About the replication factor: I would expect the exact same thing -
> changing to 6 has no influence on FILE_BYTES_WRITTEN.
>
> About the sorting: I have io.sort.mb = 100 and io.sort.factor = 10.
> Furthermore, I have 40 mappers and map output data is ~300GB. I can't
> see how that ends up in a factor 3?
>
> - tim
>
> Am 23.02.2010 14:39, schrieb Gang Luo:
>> Hi Tim,
>> the intermediate data is materialized to local file system. Before it is available
for reducers, mappers will sort them. If the buffer (io.sort.mb) is too small for the intermediate
data, multi-phase sorting happen, which means you read and write the same bit more than one
time.
>>
>> Besides, are you looking at the statistics per mapper through the job tracker, or
just the information output when a job finish? If you look at the information given out at
the end of the job, note that this is an overall statistics which include sorting at reduce
side. It also include the amount of data written to HDFS (I am not 100% sure).
>>
>> And, the FILE-BYTES_WRITTEN has nothing to do with the replication factor. I think
if you change the factor to 6, FILE_BYTES_WRITTEN is still the same.
>>
>>  -Gang
>>
>>
>> ----- 原始邮件 ----
>> 发件人: Tim Kiefer <tim-kiefer@gmx.de>
>> 收件人: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org>
>> 发送日期: 2010/2/23 (周二) 6:44:28 上午
>> 主   题: How are intermediate key/value pairs materialized between map and reduce?
>>
>> Hi there,
>>
>> can anybody help me out on a (most likely) simple unclarity.
>>
>> I am wondering how intermediate key/value pairs are materialized. I have a job where
the map phase produces 600,000 records and map output bytes is ~300GB. What I thought (up
to now) is that these 600,000 records, i.e., 300GB, are materialized locally by the mappers
and that later on reducers pull these records (based on the key).
>> What I see (and cannot explain) is that the FILE_BYTES_WRITTEN counter is as high
as ~900GB.
>>
>> So - where does the factor 3 come from between Map output bytes and FILE_BYTES_WRITTEN???
I thought about the replication factor of 3 in the file system - but that should be HDFS only?!
>>
>> Thanks
>> - tim
>>
>>
>>
>>       ___________________________________________________________
>>   好玩贺卡等你发,邮箱贺卡全新上线!
>> http://card.mail.cn.yahoo.com/
>>
>

Mime
View raw message