hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jayaseelan E <jayaseela...@ericsson.com>
Subject FW: Storing millions of small files
Date Wed, 23 May 2012 09:29:58 GMT
 

-----Original Message-----
From: Keith Wiley [mailto:kwiley@keithwiley.com] 
Sent: Tuesday, May 22, 2012 9:57 PM
To: hdfs-user@hadoop.apache.org
Subject: Re: Storing millions of small files

In addition to the responses already provided, there is another downside to using hadoop with
numerous files: it takes much longer to run a hadoop job!  Starting a hadoop job consists
of communicating between the driver (which runs on a client machine outside the cluster) and
the namenode to locate all of the input files.  Each and every individual file is located
with a set of RPCs between the client and the cluster and this is done in an entirely serial
fashion.  In experiments we ran (and gave a talk on at the Hadoop Summit in 2010) we concluded
that this overhead dominated our hadoop jobs.  By reducing the number of files (by using sequence
files) we could greatly decrease the overall job time even though that actual MapReduce time
was unaffected (by simply reducing the overhead of locating all of the files).

Here's a link to the slides from my talk:
http://www.slideshare.net/ydn/8-image-stackinghadoopsummit2010

Cheers!

On May 22, 2012, at 02:39 , Brendan cheng wrote:

> 
> Hi,
> I read HDFS architecture doc and it said HDFS is tuned for at storing large file, typically
gigabyte to terabytes.What is the downsize of storing million of small files like <10MB?
 or what setting of HDFS is suitable for storing small files?
> Actually, I plan to find a distribute filed system for storing mult million of files.
> Brendan 		 	   		  


________________________________________________________________________________
Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"You can scratch an itch, but you can't itch a scratch. Furthermore, an itch can itch but
a scratch can't scratch. Finally, a scratch can itch, but an itch can't scratch. All together
this implies: He scratched the itch from the scratch that itched but would never itch the
scratch from the itch that scratched."
                                           --  Keith Wiley ________________________________________________________________________________


Mime
View raw message