hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gang Luo <lgpub...@yahoo.com.cn>
Subject Re: sort at reduce side
Date Wed, 03 Feb 2010 20:28:02 GMT
Thanks for reply, Sriguru.
So, after shuffle at reduce side,  are the spills actually stored as map files? 

Why I ask these questions is based on some observations as following. On a 16 nodes cluster,
when I do a map join, it takes 3 and a half minutes. When I do a reduce side join on nearly
the same amount of data, it take 8 minutes before map phase complete. I am sure the computation
(map function) will not cause so much difference, the extra 4 minutes time could be only spent
on sorting at map side for reduce side join. While I also notice that the sort time at reduce
side is only 30 sec (I cannot access the online jobtracker, the 30 sec time is actually the
time reduce takes from 33% completeness to 66% completeness).  The number of reduce tasks
is much fewer than that of map tasks, which means each reduce task sort more data than each
map task (I use hash partitioner and data is uniformly distributed).  The only reason I come
up with for the big difference between the sort at map side and reduce side is the different
behaviors of these two sorts. 

Anybody has some ideas why the map takes so much time for reduce side join compared to map
side join, and why there is big difference between sort at map side and reduce side?

P.S. I join a 7.5G file with a 100M file. the sort buffer at reduce is slightly large than
that at map side.


-Gang



----- 原始邮件 ----
发件人: Srigurunath Chakravarthi <sriguru@yahoo-inc.com>
收件人: "common-user@hadoop.apache.org" <common-user@hadoop.apache.org>
发送日期: 2010/2/3 (周三) 12:50:08 上午
主   题: RE: sort at reduce side

Hi Gang,

>kept in map file. If so, in order to efficiently sort the data, reducer
>actually only read the index part of each spill (which is a map file) and
>sort the keys, instead of reading whole records from disk and sort them. 

afaik, no. Reduces always fetches map output data and not indexes (even if the data is from
the local node, where an index may be sufficient).

Regards,
Sriguru

>-----Original Message-----
>From: Gang Luo [mailto:lgpublic@yahoo.com.cn]
>Sent: Wednesday, February 03, 2010 10:40 AM
>To: common-user@hadoop.apache.org
>Subject: sort at reduce side
>
>Hi all,
>I want to know some more details about the sorting at the reduce side.
>
>The intermediate result generated at the map side is stored as map file
>which actually consists of two sub-files, namely index file and data file.
>The index file stores the keys and it could point to corresponding record
>stored in the data file.  What I think is that when intermediate result
>(even only part of it for each mapper) is shuffled to reducer, it is still
>kept in map file. If so, in order to efficiently sort the data, reducer
>actually only read the index part of each spill (which is a map file) and
>sort the keys, instead of reading whole records from disk and sort them.
>
>Does reducer actually do as what I expect?
>
>-Gang
>
>
>      ___________________________________________________________
>  好玩贺卡等你发,邮箱贺卡全新上线!
>http://card.mail.cn.yahoo.com/ 


      ___________________________________________________________ 
  好玩贺卡等你发,邮箱贺卡全新上线! 
http://card.mail.cn.yahoo.com/

Mime
View raw message