hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Elia Mazzawi <elia.mazz...@casalemedia.com>
Subject Re: hadoop benchmarked, too slow to use
Date Tue, 10 Jun 2008 23:45:47 GMT

okay I'll try that.
thanks

Joydeep Sen Sarma wrote:
> Perhaps something of the order of number of cores in ur system. At least
> 7 for sure - but perhaps 14 if u have multi-core machines (assuming
> tasks per node is 2 at least).
>
> Also - Ashish hit the nail on the head - way too many small files.
> Hadoop job overhead is killing you. At this point - I wish I could say -
> just use TextMultiFileInputFormat - except there isn't one (and I guess
> the nearest alternative - the hadoop archival stuff is not in 17). Bad
> luck.
>
> -----Original Message-----
> From: Elia Mazzawi [mailto:elia.mazzawi@casalemedia.com] 
> Sent: Tuesday, June 10, 2008 4:26 PM
> To: core-user@hadoop.apache.org
> Subject: Re: hadoop benchmarked, too slow to use
>
>
> yes there was only 1 reducer, how many should i try ?
>
>
> Joydeep Sen Sarma wrote:
>   
>> how many reducers? Perhaps u are defaulting to one reducer.
>>
>> One variable is how fast the java regex evaluation is wrt to sed. One
>> option is to use hadoop streaming and use ur sed fragment as the
>>     
> mapper.
>   
>> That will be another way of measuring hadoop overhead that eliminates
>> some variables.
>>
>> Hadoop also has a quite a few variables to tune performance .. (check
>> the hadoop wiki for yahoo's sort benchmark settings for example)
>>
>> -----Original Message-----
>> From: Elia Mazzawi [mailto:elia.mazzawi@casalemedia.com] 
>> Sent: Tuesday, June 10, 2008 3:56 PM
>> To: core-user@hadoop.apache.org
>> Subject: hadoop benchmarked, too slow to use
>>
>> Hello,
>>
>> we were considering using hadoop to process some data,
>> we have it set up on 8 nodes ( 1 master + 7 slaves)
>>
>> we filled the cluster up with files that contain tab delimited data.
>> string \tab string etc
>> then we ran the example grep with a regular expression to count the 
>> number of each unique starting string.
>> we had 3500 files containing 3,015,294 lines totaling 5 GB.
>>
>> to benchmark it we ran
>> bin/hadoop jar hadoop-0.17.0-examples.jar grep data/*  output 
>> '^[a-zA-Z]+\t'
>> it took 26 minutes
>>
>> then to compare, we ran this bash command on one of the nodes, which 
>> produced the same output out of the data:
>>
>> cat * | sed -e s/\  .*// |sort | uniq -c > /tmp/out
>> (sed regexpr is tab not spaces)
>>
>> which took 2.5 minutes
>>
>> Then we added 10X the data into the cluster and reran Hadoop, it took 
>> 214 minutes which is less than 10X the time, but still not that
>> impressive.
>>
>>
>> so we are seeing a 10X performance penalty for using Hadoop vs the 
>> system commands,
>> is that expected?
>> we were expecting hadoop to be faster since it is distributed?
>> perhaps there is too much overhead involved here?
>> is the data too small?
>>   
>>     
>
>   


Mime
View raw message