hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Bockelman <bbock...@cse.unl.edu>
Subject Re: HDFS - millions of files in one directory?
Date Mon, 26 Jan 2009 03:57:05 GMT
Hey Mark,

You'll want to watch your name node requirements -- tossing a wild- 
guess out there, a billion files could mean that you need on the order  
of terabytes of RAM in your namenode.

Have you considered using:
a) Using SequenceFile (appropriate for binary data, I believe -- but  
limits you to Sequential I/O)
b) Looking into the ARC file format which someone referenced  
previously on this list

?

Brian

On Jan 25, 2009, at 8:29 PM, Mark Kerzner wrote:

> Thank you, Jason, this is awesome information. I am going to use a  
> balanced
> directory tree structure, and I am going to make this independent of  
> the
> other parts of the system, so that I can change it later should  
> practice
> dictate me to do so.
>
> Mark
>
> On Sun, Jan 25, 2009 at 8:06 PM, jason hadoop  
> <jason.hadoop@gmail.com>wrote:
>
>> With large numbers of files you run the risk of the Datanodes  
>> timing out
>> when they are performing their block report and or DU reports.
>> Basically if a *find* in the dfs.data.dir takes more than 10  
>> minutes you
>> will have catastrophic problems with your hdfs.
>> At attributor with 2million blocks on a datanode, under XFS centos  
>> (i686)
>> 5.1 stock kernels would take 21 minutes with noatime, on a 6 disk  
>> raid 5
>> array. 8way 2.5ghz xeons 8gig ram. Raid controller was a PERC and the
>> machine basically served hdfs.
>>
>>
>> On Sun, Jan 25, 2009 at 1:49 PM, Mark Kerzner <markkerzner@gmail.com>
>> wrote:
>>
>>> Yes, flip suggested such solution, but his files are text, so he  
>>> could
>>> combine them all in a large text file, with each lined representing
>> initial
>>> files. My files, however, are binary, so I do not see how I could  
>>> combine
>>> them.
>>>
>>> However, since my numbers are limited by about 1 billion files  
>>> total, I
>>> should be OK to put them all in a few directories with under, say,  
>>> 10,000
>>> files each. Maybe a little balanced tree, but 3-4 four levels should
>>> suffice.
>>>
>>> Thank you,
>>> Mark
>>>
>>> On Sun, Jan 25, 2009 at 11:43 AM, Carfield Yim <carfield@carfield.com.hk
>>>> wrote:
>>>
>>>> Possible simple having a file large in size instead of having a  
>>>> lot of
>>>> small files?
>>>>
>>>> On Sat, Jan 24, 2009 at 7:03 AM, Mark Kerzner <markkerzner@gmail.com 
>>>> >
>>>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> there is a performance penalty in Windows (pardon the  
>>>>> expression) if
>>> you
>>>> put
>>>>> too many files in the same directory. The OS becomes very slow,  
>>>>> stops
>>>> seeing
>>>>> them, and lies about their status to my Java requests. I do not  
>>>>> know
>> if
>>>> this
>>>>> is also a problem in Linux, but in HDFS - do I need to balance a
>>>> directory
>>>>> tree if I want to store millions of files, or can I put them all  
>>>>> in
>> the
>>>> same
>>>>> directory?
>>>>>
>>>>> Thank you,
>>>>> Mark
>>>>
>>>
>>


Mime
View raw message