hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stu Hood" <stuh...@mailtrust.com>
Subject RE: Performance difference over two map-reduce solutions of same problem in different cluster sizes
Date Thu, 15 May 2008 12:22:07 GMT
I don't have an answer to that question, but I can suggest that if you are going to move forward
with your optimized solution, you can probably improve the performance considerably by using
using GenericWritable http://hadoop.apache.org/core/docs/r0.15.3/api/org/apache/hadoop/io/GenericWritable.html
.

Rather than outputting three values, wrap a single value in a GenericWritable. That way the
overhead is only a single byte per value.

Thanks,
Stu


-----Original Message-----
From: novice user <pallavip.05@gmail.com>
Sent: Wednesday, May 14, 2008 11:08pm
To: core-user@hadoop.apache.org
Subject: RE: Performance difference over two map-reduce solutions of same problem in different
cluster sizes


Thanks Runping. 
But, if that is the case, why does it took less time when I ran on a cluster
of size=1. It should have been the same irrespective of whether I am running
on a cluster of size=1 or more. right?

Thanks


Runping Qi wrote:
> 
> 
> Your diagnose sounds reasonable.
> Since the mappers of your optimized solution outputs 3 key/value pairs
> for each input key/value pair, the map output size may be three times of
> the input size for each mapper. That size map exceeds the value of
> io.sort.mb in your configuration. If so, the mappers have to spill the
> map output onto disk in multiple segments and merge them at the end.
> That is very costly.
> 
> Runping
> 
> 
>> -----Original Message-----
>> From: novice user [mailto:pallavip.05@gmail.com]
>> Sent: Wednesday, May 14, 2008 2:45 AM
>> To: core-user@hadoop.apache.org
>> Subject: Performance difference over two map-reduce solutions of same
>> problem in different cluster sizes
>> 
>> 
>> Hi,
>>  I have been working on a problem where I have to process a particular
>> data
>> and return three varieties of data and then, I have to process each of
>> them
>> and store each variety of data into separate files.
>> In order to solve the above problem, I have proposed two solutions.
> One I
>> called un-optimized, the other one, I called optimized.
>> 
>> The unoptimized flow, goes in this way:
>> 
>> I created three map-reduce jobs. In the first map-reduce job,
>>  1. During map phase, I processed the data and pass one variety of it
> to
>> reduce phase and write the remaining two varieties in different files
>> explicitly (explicitly means using Sequence.writer).
>> 2. During the reduce phase of this job, I process the first variety of
>> data
>> passed to this phase and the processed data will be stored by reduce
> task.
>> 
>>  2. Secomd map-reduce job, takes one of the variety written to files
> in
>> above map phase as input and in map phase, it
>> process this data and send it to reduce phase and the rest of the
>> processing
>> is done in reduce phase and output is stored in files.
>> 3. Third map-reduce job is similar to above one but works on the other
> set
>> of data.
>> 
>> Optimized flow go in this way:
>> 1. During map phase, the data is processed and output all three
> varieties
>> of
>> data to reduce phase
>> 2. In the reduce phase, depending on which type of data it is, the
> data is
>> processed and one of it will be outputted by reduce phase and
> remaining
>> two
>> types will be written to files explicitly.
>> 
>> And, I assumed that by optimizing the 3 map-reduce jobs to single map-
>> reduce
>> job, the process should be faster and should end faster than
> un-optimized
>> one.
>> 
>> But, what I figured out that, the optimized one run faster than
>> unoptimized
>> one when I run on a cluster of size=1. If I run the same on a cluster
> of
>> size=9, then unoptimized one took less time (1 minute difference) than
>> optimized one.
>> 
>> After looking at the analysis of each job, I figured out that, the map
>> phase
>> in PLSI is taking too much time because of outputting three values at
> a
>> time
>> for each key,value pair.
>> 
>> Can any one please let me know if any one figured out anything
> suspicious
>> in
>> above flow which would have caused this?
>> 
>> Thanks
>> 
>> --
>> View this message in context: http://www.nabble.com/Performance-
>> difference-over-two-map-reduce-solutions-of-same-problem-in-different-
>> cluster-sizes-tp17227312p17227312.html
>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Performance-difference-over-two-map-reduce-solutions-of-same-problem-in-different-cluster-sizes-tp17227312p17245197.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.




Mime
View raw message