hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: Best practices for handling many small files
Date Mon, 28 Apr 2008 16:12:54 GMT
Joydeep Sen Sarma wrote:
> There seems to be two problems with small files:
> 1. namenode overhead. (3307 seems like _a_ solution)
> 2. map-reduce processing overhead and locality 
> It's not clear from 3307 description, how the archives interface with
> map-reduce. How are the splits done? Will they solve problem #2?

Yes, I think 3307 will address (2).  Many small files will be packed 
into fewer larger files, each file typically substantially larger than a 
block.  A splitter can read the index files and then use 
MultiFileInputFormat, so that each split could contain files that are 
contained almost entirely in a single block.

Good MapReduce performance is a requirement for the design of 3307.


View raw message