hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Konstantin Shvachko <...@yahoo-inc.com>
Subject Re: Best practices for handling many small files
Date Fri, 25 Apr 2008 17:45:53 GMT
Would the new archive feature HADOOP-3307 that is currently being developed help this problem?
http://issues.apache.org/jira/browse/HADOOP-3307

--Konstantin

Subramaniam Krishnan wrote:
> 
> We have actually written a custom Multi File Splitter that collapses all 
> the small files to a single split till the DFS Block Size is hit.
> We also take care of handling big files by splitting them on Block Size 
> and adding up all the reminders(if any) to a single split.
> 
> It works great for us....:-)
> We are working on optimizing it further to club all the small files in a 
> single data node together so that the Map can have maximum local data.
> 
> We plan to share this(provided it's found acceptable, of course) once 
> this is done.
> 
> Regards,
> Subru
> 
> Stuart Sierra wrote:
> 
>> Thanks for the advice, everyone.  I'm going to go with #2, packing my
>> million files into a small number of SequenceFiles.  This is slow, but
>> only has to be done once.  My "datacenter" is Amazon Web Services :),
>> so storing a few large, compressed files is the easiest way to go.
>>
>> My code, if anyone's interested, is here:
>> http://stuartsierra.com/2008/04/24/a-million-little-files
>>
>> -Stuart
>> altlaw.org
>>
>>
>> On Wed, Apr 23, 2008 at 11:55 AM, Stuart Sierra 
>> <mail@stuartsierra.com> wrote:
>>  
>>
>>> Hello all, Hadoop newbie here, asking: what's the preferred way to
>>>  handle large (~1 million) collections of small files (10 to 100KB) in
>>>  which each file is a single "record"?
>>>
>>>  1. Ignore it, let Hadoop create a million Map processes;
>>>  2. Pack all the files into a single SequenceFile; or
>>>  3. Something else?
>>>
>>>  I started writing code to do #2, transforming a big tar.bz2 into a
>>>  BLOCK-compressed SequenceFile, with the file names as keys.  Will that
>>>  work?
>>>
>>>  Thanks,
>>>  -Stuart, altlaw.org
>>>
>>>     
> 
> 

Mime
View raw message