hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Zhang <zjf...@gmail.com>
Subject Re: Performance tuning of sort
Date Fri, 18 Jun 2010 05:34:31 GMT
The scale of each reducer depends on the Partitioner. You can think of
Partitioner as a Hash Function, and the reducer as bucket, So you can
not expect that each bucket has same number of items.

Skewed data distribution will make a few reducers cost much more time.



2010/6/18 李钰 <carp84@gmail.com>:
> Hi Jeff and Amogh,
>
> Thanks for your comments! In my understanding, in the partitioning phase
> before spilling to disk, the threads will divide the data into partitions
> corresponding to the number of reducers, as described int the Definitive
> Guide. So I think the scale of input data should be the same for each
> reducer. I wonder if I have any misunderstanding about this, please correct
> me if you find any faults, thanks.
>
> As to the reduce phases, I did check the time of shuffle, sort and reduce
> through the JT UI, but found it quite different for each reduce task. Some
> task may have longer shuffle time but less reduce time, while some may have
> less shuffle time but longer reduce time. I set the reducer number large
> enough to let all reduce tasks run in parallel, and set
> "mapred.reduce.slowstart.completed.maps" parameter to 1.0 to let them start
> at the same time when all map tasks have been finished, and I think this may
> reduce the impact of network and time cost of waiting for map task to finish
> during the shuffle phase. Then why still got quite different time spent in
> shuffle? And since the reduce phase of reduce is just writing sorted data
> into HDFS, why the time of reduce phase is different?
>
> Anything wrong with my analyzing? Any suggestions? Thanks a lot.
>
> Dear all,
>
> Any other comments? Thanks.
>
> Best Regards,
> Carp
>
>
> 在 2010年6月18日 上午11:39,Amogh Vasekar <amogh@yahoo-inc.com>写道:
>
>>
>> >>Since the scale of input data and operations of each reduce task is the
>> same, what may cause the execution time of reduce tasks different?
>>
>> You should consider looking at the copy, shuffle and reduce times
>> separately from JT UI to get better info. Many (dynamic) considerations like
>> network congestion, number of mappers reducer is fetching from, data skew
>> wrt input keys to reducer etc will affect this number.
>>
>> HTH,
>> Amogh
>>
>> On 6/18/10 8:05 AM, "李钰" <carp84@gmail.com> wrote:
>>
>> Hi Todd and Jeff,
>>
>> Thanks a lot for your discussion, it's really helpful to me. I'd like to
>> express my especial appreciation for Todd's patient explanation, you help
>> me
>> see more clearly about the working mechanism of SORT. And Jeff, really
>> thank
>> you for reminding me that sort uses TotalOrderPartitioner to do
>> partitioning.
>> Based on your discussion I update my understanding as follows:
>> The sorting happens on the map side during the spill process of each map
>> task, after that, the overall map outputs are partitioned by method of
>> TotalOrderPartitioner, this decides the input range of each reducer.
>> Reducers get map outputs as decided by the partitioner, and do merging and
>> write results into HDFS.
>> Is this understanding right? Please correct me if you find any faults,
>> thanks.
>> If this understanding is right, then my question rolls back to the original
>> one: Since the scale of input data and operations of each reduce task is
>> the
>> same, what may cause the execution time of reduce tasks different? All
>> nodes
>> used in my experiment are on the same rack, and they are homogenous.
>> Any suggesion will be highly appreciated, thanks.
>>
>> Best Regards,
>> Carp
>>
>> 2010/6/18 Todd Lipcon <todd@cloudera.com>
>>
>> > On Thu, Jun 17, 2010 at 9:37 AM, Jeff Zhang <zjffdu@gmail.com> wrote:
>> >
>> > > Todd,
>> > >
>> > > Why's there a sorting in map task, the sorting here seems useless in my
>> > > opinion.
>> > >
>> > >
>> > For map-only jobs there isn't. For jobs with reduce, typically the number
>> > of
>> > reduce tasks is smaller than the number of map tasks, so parallelizing
>> the
>> > sort on the mappers and just doing merge on the reducers is beneficial.
>> > Second, this allows the combiner to run on the mapper by identifying when
>> > it
>> > has multiple outputs for the same key. Third, this allows improved
>> > compression on the map output (thus less intermediate data transfer) by
>> > putting similar keys near each other (hopefully within the compression
>> > window). Fourth, it kills two birds with one stone since the mappers
>> > already
>> > have to group outputs by the partition.
>> >
>> > -Todd
>> >
>> >
>> > >
>> > >
>> > > On Thu, Jun 17, 2010 at 9:26 AM, Todd Lipcon <todd@cloudera.com>
>> wrote:
>> > > > On Thu, Jun 17, 2010 at 12:43 AM, Jeff Zhang <zjffdu@gmail.com>
>> wrote:
>> > > >
>> > > >> Your understanding of Sort is not right. The key concept of Sort
is
>> > > >> the TotalOrderPartitioner. Actually before the map-reduce job,
>> client
>> > > >> side will do sampling of input data to estimate the distribution
of
>> > > >> input data. And the mapper do nothing, each reducer will fetch
its
>> > > >> data according the TotalOrderPartitioner. The data in each reducer
>> is
>> > > >> local sorted, and each reducer are sorted ( r0<r1<r2....),
so the
>> > > >> overall result data is sorted.
>> > > >>
>> > > >
>> > > > The sorting happens on the map side, actually, during the spill
>> > process.
>> > > The
>> > > > mapper itself is an identity function, but the map task code does
>> > perform
>> > > a
>> > > > sort (on a <partition,key> tuple) as originally described in
this
>> > thread.
>> > > > Reducers just do a merge of mapper outputs.
>> > > >
>> > > > -Todd
>> > > >
>> > > >
>> > > >>
>> > > >>
>> > > >>
>> > > >> On Thu, Jun 17, 2010 at 12:13 AM, 李钰 <carp84@gmail.com>
wrote:
>> > > >> > Hi all,
>> > > >> >
>> > > >> > I'm doing some tuning of the sort benchmark of hadoop. To
be more
>> > > >> specified,
>> > > >> > running test against the org.apache.hadoop.examples.Sort
class. As
>> > > >> looking
>> > > >> > through the source code, I think the map tasks take responsibility
>> > of
>> > > >> > sorting the input data, and the reduce tasks just merge the
map
>> > > outputs
>> > > >> and
>> > > >> > write them into HDFS. But here I've got a question I couldn't
>> > > understand:
>> > > >> > the time cost of the reduce phase of each reduce task, that
is
>> > writing
>> > > >> data
>> > > >> > into HDFS, is different from each other. Since the input
data and
>> > > >> operations
>> > > >> > of each reduce task is the same, what reason will cause the
>> > execution
>> > > >> time
>> > > >> > different? Is there anything wrong of my understanding? Does
>> anybody
>> > > have
>> > > >> > any experience on this? Badly need your help, thanks.
>> > > >> >
>> > > >> > Best Regards,
>> > > >> > Carp
>> > > >> >
>> > > >>
>> > > >>
>> > > >>
>> > > >> --
>> > > >> Best Regards
>> > > >>
>> > > >> Jeff Zhang
>> > > >>
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > Todd Lipcon
>> > > > Software Engineer, Cloudera
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > Best Regards
>> > >
>> > > Jeff Zhang
>> > >
>> >
>> >
>> >
>> > --
>> >  Todd Lipcon
>> > Software Engineer, Cloudera
>> >
>>
>>
>



-- 
Best Regards

Jeff Zhang

Mime
View raw message