hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Elia Mazzawi <elia.mazz...@casalemedia.com>
Subject Re: hadoop benchmarked, too slow to use
Date Wed, 11 Jun 2008 18:53:49 GMT

we concatenated the files to bring them close to and less than 64mb and 
the difference was huge without changing anything else
we went from 214 minutes to 3 minutes !

Elia Mazzawi wrote:
> Thanks for the suggestions,
>
> I'm going to rerun the same test with close to < 64Mb files and 7 then 
> 14 reducers.
>
>
> we've done another test to see if more servers would speed up the 
> cluster,
>
> with 2 nodes down took 322 minutes on the 10X data thats 5.3 hours
> vs 214 minutes with all nodes online.
> started the test after hdfs marked the nodes as dead, and there were 
> no timeouts.
>
> 332/214 = 55% more time with 5/7 = 71%  servers.
>
> so our conclusion is that more servers will make the cluster faster.
>
>
>
> Ashish Thusoo wrote:
>> Try by first just reducing the number of files and increasing the data
>> in each file so you have close to 64MB of data per file. So in your case
>> that would amount to about 700-800 files in the 10X test case (instead
>> of 35000 that you have). See if that give substantially better results
>> on your larger test case. For the smaller one, I don't think you will be
>> able to do better than the unix  command - the data set is too small.
>>
>> Ashish
>> -----Original Message-----
>> From: Elia Mazzawi [mailto:elia.mazzawi@casalemedia.com] Sent: 
>> Tuesday, June 10, 2008 5:00 PM
>> To: core-user@hadoop.apache.org
>> Subject: Re: hadoop benchmarked, too slow to use
>>
>> so it would make sense for me to configure hadoop for smaller chunks?
>>
>> Elia Mazzawi wrote:
>>  
>>> yes chunk size was 64mb, and each file has some data it used 7 mappers
>>>     
>>
>>  
>>> and 1 reducer.
>>>
>>> 10X the data took 214 minutes
>>> vs 26 minutes for the smaller set
>>>
>>> i uploaded the same data 10 times in different directories ( so more 
>>> files, same size )
>>>
>>>
>>> Ashish Thusoo wrote:
>>>    
>>>> Apart from the setup times, the fact that you have 3500 files means 
>>>> that you are going after around 220GB of data as each file would have
>>>>       
>>
>>  
>>>> atleast one chunk (this calculation is assuming a chunk size of 
>>>> 64MB and this assumes that each file has atleast some data). 
>>>> Mappers would
>>>>       
>>
>>  
>>>> probably need to read up this amount of data and with 7 nodes you may
>>>>       
>>
>>  
>>>> just have
>>>> 14 map slots. I may be wrong here, but just out of curiosity how many
>>>>       
>>
>>  
>>>> mappers does your job use.
>>>>
>>>> Don't know why the 10X data was not better though if the bad 
>>>> performance of the smaller test case was due to fragmentation. For 
>>>> that test did you also increase the number of files, or did you 
>>>> simply increase the amount of data in each file.
>>>>
>>>> Plus on small sets (of the order of 2-3 GB) of data unix commands 
>>>> can't really be beaten :)
>>>>
>>>> Ashish
>>>> -----Original Message-----
>>>> From: Elia Mazzawi [mailto:elia.mazzawi@casalemedia.com] Sent: 
>>>> Tuesday, June 10, 2008 3:56 PM
>>>> To: core-user@hadoop.apache.org
>>>> Subject: hadoop benchmarked, too slow to use
>>>>
>>>> Hello,
>>>>
>>>> we were considering using hadoop to process some data, we have it set
>>>>       
>>
>>  
>>>> up on 8 nodes ( 1 master + 7 slaves)
>>>>
>>>> we filled the cluster up with files that contain tab delimited data.
>>>> string \tab string etc
>>>> then we ran the example grep with a regular expression to count the 
>>>> number of each unique starting string.
>>>> we had 3500 files containing 3,015,294 lines totaling 5 GB.
>>>>
>>>> to benchmark it we ran
>>>> bin/hadoop jar hadoop-0.17.0-examples.jar grep data/*  output 
>>>> '^[a-zA-Z]+\t'
>>>> it took 26 minutes
>>>>
>>>> then to compare, we ran this bash command on one of the nodes, 
>>>> which produced the same output out of the data:
>>>>
>>>> cat * | sed -e s/\  .*// |sort | uniq -c > /tmp/out (sed regexpr is 
>>>> tab not spaces)
>>>>
>>>> which took 2.5 minutes
>>>>
>>>> Then we added 10X the data into the cluster and reran Hadoop, it took
>>>> 214 minutes which is less than 10X the time, but still not that 
>>>> impressive.
>>>>
>>>>
>>>> so we are seeing a 10X performance penalty for using Hadoop vs the 
>>>> system commands, is that expected?
>>>> we were expecting hadoop to be faster since it is distributed?
>>>> perhaps there is too much overhead involved here?
>>>> is the data too small?
>>>>         
>>
>>   
>


Mime
View raw message