hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <edlinuxg...@gmail.com>
Subject Re: how to improve the Hadoop's capability of dealing with small files
Date Thu, 07 May 2009 14:52:37 GMT
2009/5/7 Jeff Hammerbacher <hammer@cloudera.com>:
> Hey,
>
> You can read more about why small files are difficult for HDFS at
> http://www.cloudera.com/blog/2009/02/02/the-small-files-problem.
>
> Regards,
> Jeff
>
> 2009/5/7 Piotr Praczyk <piotr.praczyk@gmail.com>
>
>> If You want to use many small files, they are probably having the same
>> purpose and struc?
>> Why not use HBase instead of a raw HDFS ? Many small files would be packed
>> together and the problem would disappear.
>>
>> cheers
>> Piotr
>>
>> 2009/5/7 Jonathan Cao <jonathanc@rockyou.com>
>>
>> > There are at least two design choices in Hadoop that have implications
>> for
>> > your scenario.
>> > 1. All the HDFS meta data is stored in name node memory -- the memory
>> size
>> > is one limitation on how many "small" files you can have
>> >
>> > 2. The efficiency of map/reduce paradigm dictates that each
>> mapper/reducer
>> > job has enough work to offset the overhead of spawning the job.  It
>> relies
>> > on each task reading contiguous chuck of data (typically 64MB), your
>> small
>> > file situation will change those efficient sequential reads to larger
>> > number
>> > of inefficient random reads.
>> >
>> > Of course, small is a relative term?
>> >
>> > Jonathan
>> >
>> > 2009/5/6 陈桂芬 <chenguifen_hz@163.com>
>> >
>> > > Hi:
>> > >
>> > > In my application, there are many small files. But the hadoop is
>> designed
>> > > to deal with many large files.
>> > >
>> > > I want to know why hadoop doesn't support small files very well and
>> where
>> > > is the bottleneck. And what can I do to improve the Hadoop's capability
>> > of
>> > > dealing with small files.
>> > >
>> > > Thanks.
>> > >
>> > >
>> >
>>
>
When the small file problem comes up most of the talk centers around
the inode table being in memory. The cloudera blog points out
something:

Furthermore, HDFS is not geared up to efficiently accessing small
files: it is primarily designed for streaming access of large files.
Reading through small files normally causes lots of seeks and lots of
hopping from datanode to datanode to retrieve each small file, all of
which is an inefficient data access pattern.

My application attempted to load 9000 6Kb files using a single
threaded application and the FSOutpustStream objects to write directly
to hadoop files. My plan was to have hadoop merge these files in the
next step. I had to abandon this plan because this process was taking
hours. I knew HDFS had a "small file problem" but I never realized
that I could not do this problem the 'old fashioned way'. I merged the
files locally and uploading a few small files gave great throughput.
Small files is not just a permanent storage issue it is a serious
optimization.

Mime
View raw message