Return-Path: Delivered-To: apmail-lucene-hadoop-user-archive@locus.apache.org Received: (qmail 41197 invoked from network); 5 Nov 2007 00:40:57 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 5 Nov 2007 00:40:57 -0000 Received: (qmail 40844 invoked by uid 500); 5 Nov 2007 00:40:44 -0000 Delivered-To: apmail-lucene-hadoop-user-archive@lucene.apache.org Received: (qmail 40827 invoked by uid 500); 5 Nov 2007 00:40:44 -0000 Mailing-List: contact hadoop-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-user@lucene.apache.org Delivered-To: mailing list hadoop-user@lucene.apache.org Received: (qmail 40818 invoked by uid 99); 5 Nov 2007 00:40:44 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 04 Nov 2007 16:40:44 -0800 X-ASF-Spam-Status: No, hits=2.8 required=10.0 tests=RCVD_IN_DNSWL_LOW,RCVD_NUMERIC_HELO,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [69.50.2.13] (HELO ex9.myhostedexchange.com) (69.50.2.13) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 05 Nov 2007 00:40:46 +0000 Received: from 75.80.179.210 ([75.80.179.210]) by ex9.hostedexchange.local ([69.50.2.13]) with Microsoft Exchange Server HTTP-DAV ; Mon, 5 Nov 2007 00:40:25 +0000 User-Agent: Microsoft-Entourage/11.3.3.061214 Date: Sun, 04 Nov 2007 16:40:21 -0700 Subject: Re: Very weak mapred performance on small clusters with a massive amount of small files From: Ted Dunning To: Message-ID: Thread-Topic: Very weak mapred performance on small clusters with a massive amount of small files Thread-Index: AcgfRHHysF/IxIs3EdyAKwAWy8rVfQ== In-Reply-To: <472E4F88.2040001@andremartin.de> Mime-version: 1.0 Content-type: text/plain; charset="ISO-8859-1" Content-transfer-encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org If your larger run is typical of your smaller run then you have lots and lots of small files. This is going to make things slow even without the overhead of a distributed computation. In the sequential case, enumerating the files an inefficient read patterns will be what slows you down. The inefficient reads come about because the disk has to seek every 100KB of input. That is bad. In the hadoop case, things are worse because opening a file takes much longer than with local files. The solution is for you to package your data more efficiently. This fixes = a multitude of ills. If you don't mind limiting your available parallelism a little bit, you could even use tar files (tar isn't usually recommended because you can't split a tar file across maps). If you were to package 1000 files per bundle, you would get average file sizes of 100MB instead of 100KB and your file opening overhead in the parallel case would be decreased by 1000x. Your disk read speed would be much higher as well because your disks would mostly be reading contiguous sectors. I have a system similar to yours with lots and lots of little files (little= r than yours even). With aggressive file bundling I can routinely process data at a sustained rate of 100MB/s on ten really crummy storage/compute nodes. Moreover, that rate is probably not even bounded by I/O since my data takes a fair bit of CPU to decrypt and parse. On 11/4/07 4:02 PM, "Andr=E9 Martin" wrote: > Hi Enis & Hadoopers, > thanks for the hint. I created/modified my RecordReader so that it uses > MultiFileInputSplit and reads 30 files at once (by spawning several > threads and using a bounded buffer =E0la producer/consumer). The > accumulated throughput is now about 1MB/s on my 30 MB test data (spread > over 300 files). > However, I noticed some other bottlenecks during job submissions - a job > submission of 53.000 files spread over 18,150 folders takes about 1hr > and 45 mins.. > Since all the files are spread over severals thousand directories - > listing/traversing of those directories using the listpath / globpaths > method generates several thousands RPC calls. I think it would be more > efficient to send the regex/path expression (the parameters) of the > globpaths method to the server and traversing the directory tree on the > server side instead of client side, or is there another way to retrieve > all the file paths? > Also, for each of my thousand files, a getBlockLocation RPC call is/was > generated - I implemented/added a getBlockLocations[] method that > accepts an array of paths etc. and returns a String[][][] matrix instead > which is much more very efficient then generating thousands of RPC calls > when calling getBlockLocation in the MultiFileSplit class... > Any thoughts/comments are much appreciated! > Thanks in advance! >=20 > Cu on the 'net, > Bye - bye, >=20 > <<<<< Andr=E9 <<<< >>>> =E8rbnA >>>>> >=20 > Enis Soztutar wrote: >> Hi, >>=20 >> I think you should try using MultiFileInputFormat/MultiFileInputSplit >> rather than FileSplit, since the former is optimized for processing >> large number of files. Could you report you numMaps and numReduces and >> the avarage time the map() function is expected to take. >=20