hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <tdunn...@veoh.com>
Subject Re: File size and number of files considerations
Date Mon, 10 Mar 2008 15:37:16 GMT

Amar's comments are a little strange.

Replication occurs at the block level, not the file level.  Storing data in
a small number of large files or a large number of small files will have
less than a factor of two effect on number of replicated blocks if the small
files are >64MB.  Files smaller than that will hurt performance due to seek
costs.

To address Naama's question, you should consolidate your files so that you
have files of at least 64 MB and preferably a bit larger than that.  This
helps because it allows the reading of the files to proceed in a nice
sequential manner which can greatly increase throughput.

If consolidating these files off-line is difficult, it is easy to do in a
preliminary map-reduce step.  This will incur a one-time cost, but if you
are doing multiple passes over the data later, it will be worth it.


On 3/10/08 3:12 AM, "Amar Kamat" <amarrk@yahoo-inc.com> wrote:

> On Mon, 10 Mar 2008, Naama Kraus wrote:
> 
>> Hi,
>> 
>> In our system, we plan to upload data into Hadoop from external sources and
>> use it later on for analysis tasks. The interface to the external
>> repositories allows us to fetch pieces of data in chunks. E.g. get n records
>> at a time. Records are relatively small, though the overall amount of data
>> is assumed to be large. For each repository, we fetch pieces of data in a
>> serial manner. Number of repositories is small (few of them).
>> 
>> My first step is to put the data in plain files in HDFS. My question is what
>> is the optimized file sizes to use. Many small files (to the extent of each
>> record in a file) ? - guess not. Few huge files each holding all data of
>> same type ? Or maybe put each chunk we get in a separate file, and close it
>> right after a chunk was uploaded ?
>> 
> I think it should be more based on the size of the data you want to
> process in a map which I think here is the chunk size, no?
> Larger the file less the replicas and hence more the network transfers in
> case of more maps. In case of smaller file size the NN will be bottleneck
> but you will end up having more replicas for each map task and hence more
> locality.
> Amar
>> How would HFDS perform best, with few large files or more smaller files ? As
>> I wrote we plan to run MapReduce jobs over the data in the files in order to
>> organize the data and analyze it.
>> 
>> Thanks for any help,
>> Naama
>> 
>> 


Mime
View raw message