hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@veoh.com>
Subject Re: Best practices for handling many small files
Date Thu, 24 Apr 2008 16:14:15 GMT

Stuart,

This will have the (slightly) desirable side-effect of making your total
disk foot-print smaller.  I don't suppose that matters all that much any
more, but it is still a nice thought.


On 4/24/08 8:28 AM, "Stuart Sierra" <mail@stuartsierra.com> wrote:

> Thanks for the advice, everyone.  I'm going to go with #2, packing my
> million files into a small number of SequenceFiles.  This is slow, but
> only has to be done once.  My "datacenter" is Amazon Web Services :),
> so storing a few large, compressed files is the easiest way to go.
> 
> My code, if anyone's interested, is here:
> http://stuartsierra.com/2008/04/24/a-million-little-files
> 
> -Stuart
> altlaw.org
> 
> 
> On Wed, Apr 23, 2008 at 11:55 AM, Stuart Sierra <mail@stuartsierra.com> wrote:
>> Hello all, Hadoop newbie here, asking: what's the preferred way to
>>  handle large (~1 million) collections of small files (10 to 100KB) in
>>  which each file is a single "record"?
>> 
>>  1. Ignore it, let Hadoop create a million Map processes;
>>  2. Pack all the files into a single SequenceFile; or
>>  3. Something else?
>> 
>>  I started writing code to do #2, transforming a big tar.bz2 into a
>>  BLOCK-compressed SequenceFile, with the file names as keys.  Will that
>>  work?
>> 
>>  Thanks,
>>  -Stuart, altlaw.org
>> 


Mime
View raw message