hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ashish Thusoo" <athu...@facebook.com>
Subject RE: hadoop benchmarked, too slow to use
Date Tue, 10 Jun 2008 23:17:29 GMT
Apart from the setup times, the fact that you have 3500 files means that
you are going after around 220GB of data as each file would have atleast
one chunk (this calculation is assuming a chunk size of 64MB and this
assumes that each file has atleast some data). Mappers would probably
need to read up this amount of data and with 7 nodes you may just have
14 map slots. I may be wrong here, but just out of curiosity how many
mappers does your job use.

Don't know why the 10X data was not better though if the bad performance
of the smaller test case was due to fragmentation. For that test did you
also increase the number of files, or did you simply increase the amount
of data in each file.

Plus on small sets (of the order of 2-3 GB) of data unix commands can't
really be beaten :)


-----Original Message-----
From: Elia Mazzawi [mailto:elia.mazzawi@casalemedia.com] 
Sent: Tuesday, June 10, 2008 3:56 PM
To: core-user@hadoop.apache.org
Subject: hadoop benchmarked, too slow to use


we were considering using hadoop to process some data, we have it set up
on 8 nodes ( 1 master + 7 slaves)

we filled the cluster up with files that contain tab delimited data.
string \tab string etc
then we ran the example grep with a regular expression to count the
number of each unique starting string.
we had 3500 files containing 3,015,294 lines totaling 5 GB.

to benchmark it we ran
bin/hadoop jar hadoop-0.17.0-examples.jar grep data/*  output
it took 26 minutes

then to compare, we ran this bash command on one of the nodes, which
produced the same output out of the data:

cat * | sed -e s/\  .*// |sort | uniq -c > /tmp/out (sed regexpr is tab
not spaces)

which took 2.5 minutes

Then we added 10X the data into the cluster and reran Hadoop, it took
214 minutes which is less than 10X the time, but still not that

so we are seeing a 10X performance penalty for using Hadoop vs the 
system commands,
is that expected?
we were expecting hadoop to be faster since it is distributed?
perhaps there is too much overhead involved here?
is the data too small?

View raw message