hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arun C Murthy <ar...@yahoo-inc.com>
Subject Re: hadoop benchmarked, too slow to use
Date Wed, 11 Jun 2008 20:08:46 GMT

On Jun 11, 2008, at 11:53 AM, Elia Mazzawi wrote:

>
> we concatenated the files to bring them close to and less than 64mb  
> and the difference was huge without changing anything else
> we went from 214 minutes to 3 minutes !
>

*smile*

How many reduces are you running now? 1 or more?

Arun

> Elia Mazzawi wrote:
>> Thanks for the suggestions,
>>
>> I'm going to rerun the same test with close to < 64Mb files and 7  
>> then 14 reducers.
>>
>>
>> we've done another test to see if more servers would speed up the  
>> cluster,
>>
>> with 2 nodes down took 322 minutes on the 10X data thats 5.3 hours
>> vs 214 minutes with all nodes online.
>> started the test after hdfs marked the nodes as dead, and there  
>> were no timeouts.
>>
>> 332/214 = 55% more time with 5/7 = 71%  servers.
>>
>> so our conclusion is that more servers will make the cluster faster.
>>
>>
>>
>> Ashish Thusoo wrote:
>>> Try by first just reducing the number of files and increasing the  
>>> data
>>> in each file so you have close to 64MB of data per file. So in  
>>> your case
>>> that would amount to about 700-800 files in the 10X test case  
>>> (instead
>>> of 35000 that you have). See if that give substantially better  
>>> results
>>> on your larger test case. For the smaller one, I don't think you  
>>> will be
>>> able to do better than the unix  command - the data set is too  
>>> small.
>>>
>>> Ashish
>>> -----Original Message-----
>>> From: Elia Mazzawi [mailto:elia.mazzawi@casalemedia.com] Sent:  
>>> Tuesday, June 10, 2008 5:00 PM
>>> To: core-user@hadoop.apache.org
>>> Subject: Re: hadoop benchmarked, too slow to use
>>>
>>> so it would make sense for me to configure hadoop for smaller  
>>> chunks?
>>>
>>> Elia Mazzawi wrote:
>>>
>>>> yes chunk size was 64mb, and each file has some data it used 7  
>>>> mappers
>>>>
>>>
>>>
>>>> and 1 reducer.
>>>>
>>>> 10X the data took 214 minutes
>>>> vs 26 minutes for the smaller set
>>>>
>>>> i uploaded the same data 10 times in different directories ( so  
>>>> more files, same size )
>>>>
>>>>
>>>> Ashish Thusoo wrote:
>>>>
>>>>> Apart from the setup times, the fact that you have 3500 files  
>>>>> means that you are going after around 220GB of data as each  
>>>>> file would have
>>>>>
>>>
>>>
>>>>> atleast one chunk (this calculation is assuming a chunk size of  
>>>>> 64MB and this assumes that each file has atleast some data).  
>>>>> Mappers would
>>>>>
>>>
>>>
>>>>> probably need to read up this amount of data and with 7 nodes  
>>>>> you may
>>>>>
>>>
>>>
>>>>> just have
>>>>> 14 map slots. I may be wrong here, but just out of curiosity  
>>>>> how many
>>>>>
>>>
>>>
>>>>> mappers does your job use.
>>>>>
>>>>> Don't know why the 10X data was not better though if the bad  
>>>>> performance of the smaller test case was due to fragmentation.  
>>>>> For that test did you also increase the number of files, or did  
>>>>> you simply increase the amount of data in each file.
>>>>>
>>>>> Plus on small sets (of the order of 2-3 GB) of data unix  
>>>>> commands can't really be beaten :)
>>>>>
>>>>> Ashish
>>>>> -----Original Message-----
>>>>> From: Elia Mazzawi [mailto:elia.mazzawi@casalemedia.com] Sent:  
>>>>> Tuesday, June 10, 2008 3:56 PM
>>>>> To: core-user@hadoop.apache.org
>>>>> Subject: hadoop benchmarked, too slow to use
>>>>>
>>>>> Hello,
>>>>>
>>>>> we were considering using hadoop to process some data, we have  
>>>>> it set
>>>>>
>>>
>>>
>>>>> up on 8 nodes ( 1 master + 7 slaves)
>>>>>
>>>>> we filled the cluster up with files that contain tab delimited  
>>>>> data.
>>>>> string \tab string etc
>>>>> then we ran the example grep with a regular expression to count  
>>>>> the number of each unique starting string.
>>>>> we had 3500 files containing 3,015,294 lines totaling 5 GB.
>>>>>
>>>>> to benchmark it we ran
>>>>> bin/hadoop jar hadoop-0.17.0-examples.jar grep data/*  output '^ 
>>>>> [a-zA-Z]+\t'
>>>>> it took 26 minutes
>>>>>
>>>>> then to compare, we ran this bash command on one of the nodes,  
>>>>> which produced the same output out of the data:
>>>>>
>>>>> cat * | sed -e s/\  .*// |sort | uniq -c > /tmp/out (sed  
>>>>> regexpr is tab not spaces)
>>>>>
>>>>> which took 2.5 minutes
>>>>>
>>>>> Then we added 10X the data into the cluster and reran Hadoop,  
>>>>> it took
>>>>> 214 minutes which is less than 10X the time, but still not that  
>>>>> impressive.
>>>>>
>>>>>
>>>>> so we are seeing a 10X performance penalty for using Hadoop vs  
>>>>> the system commands, is that expected?
>>>>> we were expecting hadoop to be faster since it is distributed?
>>>>> perhaps there is too much overhead involved here?
>>>>> is the data too small?
>>>>>
>>>
>>>
>>
>


Mime
View raw message