hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Miles Osborne" <mi...@inf.ed.ac.uk>
Subject Re: hadoop benchmarked, too slow to use
Date Tue, 10 Jun 2008 23:30:38 GMT
Why not do a little experiment and see what the timing results are when
using a range of reducers

eg 1, 2, 5, 7, 13

Miles

2008/6/11 Elia Mazzawi <elia.mazzawi@casalemedia.com>:

>
> yes there was only 1 reducer, how many should i try ?
>
>
>
> Joydeep Sen Sarma wrote:
>
>> how many reducers? Perhaps u are defaulting to one reducer.
>>
>> One variable is how fast the java regex evaluation is wrt to sed. One
>> option is to use hadoop streaming and use ur sed fragment as the mapper.
>> That will be another way of measuring hadoop overhead that eliminates
>> some variables.
>>
>> Hadoop also has a quite a few variables to tune performance .. (check
>> the hadoop wiki for yahoo's sort benchmark settings for example)
>>
>> -----Original Message-----
>> From: Elia Mazzawi [mailto:elia.mazzawi@casalemedia.com] Sent: Tuesday,
>> June 10, 2008 3:56 PM
>> To: core-user@hadoop.apache.org
>> Subject: hadoop benchmarked, too slow to use
>>
>> Hello,
>>
>> we were considering using hadoop to process some data,
>> we have it set up on 8 nodes ( 1 master + 7 slaves)
>>
>> we filled the cluster up with files that contain tab delimited data.
>> string \tab string etc
>> then we ran the example grep with a regular expression to count the number
>> of each unique starting string.
>> we had 3500 files containing 3,015,294 lines totaling 5 GB.
>>
>> to benchmark it we ran
>> bin/hadoop jar hadoop-0.17.0-examples.jar grep data/*  output
>> '^[a-zA-Z]+\t'
>> it took 26 minutes
>>
>> then to compare, we ran this bash command on one of the nodes, which
>> produced the same output out of the data:
>>
>> cat * | sed -e s/\  .*// |sort | uniq -c > /tmp/out
>> (sed regexpr is tab not spaces)
>>
>> which took 2.5 minutes
>>
>> Then we added 10X the data into the cluster and reran Hadoop, it took 214
>> minutes which is less than 10X the time, but still not that
>> impressive.
>>
>>
>> so we are seeing a 10X performance penalty for using Hadoop vs the system
>> commands,
>> is that expected?
>> we were expecting hadoop to be faster since it is distributed?
>> perhaps there is too much overhead involved here?
>> is the data too small?
>>
>>
>
>


-- 
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message