hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gang Luo <lgpub...@yahoo.com.cn>
Subject Re: How are intermediate key/value pairs materialized between map and reduce?
Date Tue, 23 Feb 2010 13:39:10 GMT
Hi Tim,
the intermediate data is materialized to local file system. Before it is available for reducers,
mappers will sort them. If the buffer (io.sort.mb) is too small for the intermediate data,
multi-phase sorting happen, which means you read and write the same bit more than one time.

Besides, are you looking at the statistics per mapper through the job tracker, or just the
information output when a job finish? If you look at the information given out at the end
of the job, note that this is an overall statistics which include sorting at reduce side.
It also include the amount of data written to HDFS (I am not 100% sure).

And, the FILE-BYTES_WRITTEN has nothing to do with the replication factor. I think if you
change the factor to 6, FILE_BYTES_WRITTEN is still the same.


----- 原始邮件 ----
发件人: Tim Kiefer <tim-kiefer@gmx.de>
收件人: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org>
发送日期: 2010/2/23 (周二) 6:44:28 上午
主   题: How are intermediate key/value pairs materialized between map and reduce?

Hi there,

can anybody help me out on a (most likely) simple unclarity.

I am wondering how intermediate key/value pairs are materialized. I have a job where the map
phase produces 600,000 records and map output bytes is ~300GB. What I thought (up to now)
is that these 600,000 records, i.e., 300GB, are materialized locally by the mappers and that
later on reducers pull these records (based on the key).
What I see (and cannot explain) is that the FILE_BYTES_WRITTEN counter is as high as ~900GB.

So - where does the factor 3 come from between Map output bytes and FILE_BYTES_WRITTEN???
I thought about the replication factor of 3 in the file system - but that should be HDFS only?!

- tim


View raw message