hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joydeep Sen Sarma" <jssa...@facebook.com>
Subject RE: hadoop benchmarked, too slow to use
Date Tue, 10 Jun 2008 23:16:43 GMT
how many reducers? Perhaps u are defaulting to one reducer.

One variable is how fast the java regex evaluation is wrt to sed. One
option is to use hadoop streaming and use ur sed fragment as the mapper.
That will be another way of measuring hadoop overhead that eliminates
some variables.

Hadoop also has a quite a few variables to tune performance .. (check
the hadoop wiki for yahoo's sort benchmark settings for example)

-----Original Message-----
From: Elia Mazzawi [mailto:elia.mazzawi@casalemedia.com] 
Sent: Tuesday, June 10, 2008 3:56 PM
To: core-user@hadoop.apache.org
Subject: hadoop benchmarked, too slow to use


we were considering using hadoop to process some data,
we have it set up on 8 nodes ( 1 master + 7 slaves)

we filled the cluster up with files that contain tab delimited data.
string \tab string etc
then we ran the example grep with a regular expression to count the 
number of each unique starting string.
we had 3500 files containing 3,015,294 lines totaling 5 GB.

to benchmark it we ran
bin/hadoop jar hadoop-0.17.0-examples.jar grep data/*  output 
it took 26 minutes

then to compare, we ran this bash command on one of the nodes, which 
produced the same output out of the data:

cat * | sed -e s/\  .*// |sort | uniq -c > /tmp/out
(sed regexpr is tab not spaces)

which took 2.5 minutes

Then we added 10X the data into the cluster and reran Hadoop, it took 
214 minutes which is less than 10X the time, but still not that

so we are seeing a 10X performance penalty for using Hadoop vs the 
system commands,
is that expected?
we were expecting hadoop to be faster since it is distributed?
perhaps there is too much overhead involved here?
is the data too small?

View raw message