hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keith Wiley <kwi...@keithwiley.com>
Subject Re: Storing millions of small files
Date Tue, 22 May 2012 16:27:04 GMT
In addition to the responses already provided, there is another downside to using hadoop with
numerous files: it takes much longer to run a hadoop job!  Starting a hadoop job consists
of communicating between the driver (which runs on a client machine outside the cluster) and
the namenode to locate all of the input files.  Each and every individual file is located
with a set of RPCs between the client and the cluster and this is done in an entirely serial
fashion.  In experiments we ran (and gave a talk on at the Hadoop Summit in 2010) we concluded
that this overhead dominated our hadoop jobs.  By reducing the number of files (by using sequence
files) we could greatly decrease the overall job time even though that actual MapReduce time
was unaffected (by simply reducing the overhead of locating all of the files).

Here's a link to the slides from my talk:


On May 22, 2012, at 02:39 , Brendan cheng wrote:

> Hi,
> I read HDFS architecture doc and it said HDFS is tuned for at storing large file, typically
gigabyte to terabytes.What is the downsize of storing million of small files like <10MB?
 or what setting of HDFS is suitable for storing small files?
> Actually, I plan to find a distribute filed system for storing mult million of files.
> Brendan 		 	   		  

Keith Wiley     kwiley@keithwiley.com     keithwiley.com    music.keithwiley.com

"You can scratch an itch, but you can't itch a scratch. Furthermore, an itch can
itch but a scratch can't scratch. Finally, a scratch can itch, but an itch can't
scratch. All together this implies: He scratched the itch from the scratch that
itched but would never itch the scratch from the itch that scratched."
                                           --  Keith Wiley

View raw message