Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 44624 invoked from network); 11 Jun 2008 20:51:45 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 11 Jun 2008 20:51:45 -0000 Received: (qmail 67778 invoked by uid 500); 11 Jun 2008 20:51:44 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 66997 invoked by uid 500); 11 Jun 2008 20:51:42 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 66985 invoked by uid 99); 11 Jun 2008 20:51:42 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 Jun 2008 13:51:42 -0700 X-ASF-Spam-Status: No, hits=-1.0 required=10.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of athusoo@facebook.com designates 204.15.23.140 as permitted sender) Received: from [204.15.23.140] (HELO mailout-sf2p.facebook.com) (204.15.23.140) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 11 Jun 2008 20:50:53 +0000 Received: from sf2pmxf02.TheFacebook.com (sf2pmxf02.thefacebook.com [192.168.16.13]) by pp01.sf2p.tfbnw.net (8.14.1/8.14.1) with ESMTP id m5BKpAOO023416 for ; Wed, 11 Jun 2008 13:51:10 -0700 Received: from Sf2pmxb04.TheFacebook.com ([192.168.16.99]) by sf2pmxf02.TheFacebook.com with Microsoft SMTPSVC(6.0.3790.3959); Wed, 11 Jun 2008 13:51:10 -0700 X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Subject: RE: hadoop benchmarked, too slow to use Date: Wed, 11 Jun 2008 13:50:53 -0700 Message-ID: In-Reply-To: <48501F3D.9040708@casalemedia.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: hadoop benchmarked, too slow to use Thread-Index: AcjL9J97oElFl4NkRX6TCNuclQzXzQAAE6MA References: <1213137468.94004.ezmlm@hadoop.apache.org> <484F0583.5040002@casalemedia.com> <484F069C.7050002@casalemedia.com> <484F0E9C.9050004@casalemedia.com> <484F157A.8060806@casalemedia.com> <485001C5.50804@casalemedia.com> <48501F3D.9040708@casalemedia.com> From: "Ashish Thusoo" To: X-OriginalArrivalTime: 11 Jun 2008 20:51:10.0804 (UTC) FILETIME=[E111E140:01C8CC04] X-Proofpoint-Virus-Version: vendor=fsecure engine=1.12.7160:2.4.4,1.2.40,4.0.166 definitions=2008-06-11_06:2008-06-10,2008-06-11,2008-06-11 signatures=0 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 ipscore=0 phishscore=0 bulkscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx engine=5.0.0-0805090000 definitions=main-0806110148 X-Virus-Checked: Checked by ClamAV on apache.org good to know... this puppy does scale :) and hadoop is awesome for what it does... Ashish -----Original Message----- From: Elia Mazzawi [mailto:elia.mazzawi@casalemedia.com]=20 Sent: Wednesday, June 11, 2008 11:54 AM To: core-user@hadoop.apache.org Subject: Re: hadoop benchmarked, too slow to use we concatenated the files to bring them close to and less than 64mb and the difference was huge without changing anything else we went from 214 minutes to 3 minutes ! Elia Mazzawi wrote: > Thanks for the suggestions, > > I'm going to rerun the same test with close to < 64Mb files and 7 then > 14 reducers. > > > we've done another test to see if more servers would speed up the=20 > cluster, > > with 2 nodes down took 322 minutes on the 10X data thats 5.3 hours vs=20 > 214 minutes with all nodes online. > started the test after hdfs marked the nodes as dead, and there were=20 > no timeouts. > > 332/214 =3D 55% more time with 5/7 =3D 71% servers. > > so our conclusion is that more servers will make the cluster faster. > > > > Ashish Thusoo wrote: >> Try by first just reducing the number of files and increasing the=20 >> data in each file so you have close to 64MB of data per file. So in=20 >> your case that would amount to about 700-800 files in the 10X test=20 >> case (instead of 35000 that you have). See if that give substantially >> better results on your larger test case. For the smaller one, I don't >> think you will be able to do better than the unix command - the data set is too small. >> >> Ashish >> -----Original Message----- >> From: Elia Mazzawi [mailto:elia.mazzawi@casalemedia.com] Sent:=20 >> Tuesday, June 10, 2008 5:00 PM >> To: core-user@hadoop.apache.org >> Subject: Re: hadoop benchmarked, too slow to use >> >> so it would make sense for me to configure hadoop for smaller chunks? >> >> Elia Mazzawi wrote: >> =20 >>> yes chunk size was 64mb, and each file has some data it used 7=20 >>> mappers >>> =20 >> >> =20 >>> and 1 reducer. >>> >>> 10X the data took 214 minutes >>> vs 26 minutes for the smaller set >>> >>> i uploaded the same data 10 times in different directories ( so more >>> files, same size ) >>> >>> >>> Ashish Thusoo wrote: >>> =20 >>>> Apart from the setup times, the fact that you have 3500 files means >>>> that you are going after around 220GB of data as each file would=20 >>>> have >>>> =20 >> >> =20 >>>> atleast one chunk (this calculation is assuming a chunk size of=20 >>>> 64MB and this assumes that each file has atleast some data). >>>> Mappers would >>>> =20 >> >> =20 >>>> probably need to read up this amount of data and with 7 nodes you=20 >>>> may >>>> =20 >> >> =20 >>>> just have >>>> 14 map slots. I may be wrong here, but just out of curiosity how=20 >>>> many >>>> =20 >> >> =20 >>>> mappers does your job use. >>>> >>>> Don't know why the 10X data was not better though if the bad=20 >>>> performance of the smaller test case was due to fragmentation. For=20 >>>> that test did you also increase the number of files, or did you=20 >>>> simply increase the amount of data in each file. >>>> >>>> Plus on small sets (of the order of 2-3 GB) of data unix commands=20 >>>> can't really be beaten :) >>>> >>>> Ashish >>>> -----Original Message----- >>>> From: Elia Mazzawi [mailto:elia.mazzawi@casalemedia.com] Sent:=20 >>>> Tuesday, June 10, 2008 3:56 PM >>>> To: core-user@hadoop.apache.org >>>> Subject: hadoop benchmarked, too slow to use >>>> >>>> Hello, >>>> >>>> we were considering using hadoop to process some data, we have it=20 >>>> set >>>> =20 >> >> =20 >>>> up on 8 nodes ( 1 master + 7 slaves) >>>> >>>> we filled the cluster up with files that contain tab delimited data. >>>> string \tab string etc >>>> then we ran the example grep with a regular expression to count the >>>> number of each unique starting string. >>>> we had 3500 files containing 3,015,294 lines totaling 5 GB. >>>> >>>> to benchmark it we ran >>>> bin/hadoop jar hadoop-0.17.0-examples.jar grep data/* output=20 >>>> '^[a-zA-Z]+\t' >>>> it took 26 minutes >>>> >>>> then to compare, we ran this bash command on one of the nodes,=20 >>>> which produced the same output out of the data: >>>> >>>> cat * | sed -e s/\ .*// |sort | uniq -c > /tmp/out (sed regexpr is >>>> tab not spaces) >>>> >>>> which took 2.5 minutes >>>> >>>> Then we added 10X the data into the cluster and reran Hadoop, it=20 >>>> took >>>> 214 minutes which is less than 10X the time, but still not that=20 >>>> impressive. >>>> >>>> >>>> so we are seeing a 10X performance penalty for using Hadoop vs the=20 >>>> system commands, is that expected? >>>> we were expecting hadoop to be faster since it is distributed? >>>> perhaps there is too much overhead involved here? >>>> is the data too small? >>>> =20 >> >> =20 >