hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Kerzner <markkerz...@gmail.com>
Subject Re: HDFS - millions of files in one directory?
Date Mon, 26 Jan 2009 04:40:21 GMT
Brian,

all the replies point in the direction of combining all files into one. I
have a few stages of processing, but each one is sequential. So I can create
one large file after another, and the performance will be the best it can
be, no deterioration from my artificial limitations.

I planned to have a little descriptor file next to the actual one - but I
may just as easily write the descriptor right after the actual file.

Thank you,
Mark

On Sun, Jan 25, 2009 at 9:57 PM, Brian Bockelman <bbockelm@cse.unl.edu>wrote:

> Hey Mark,
>
> You'll want to watch your name node requirements -- tossing a wild-guess
> out there, a billion files could mean that you need on the order of
> terabytes of RAM in your namenode.
>
> Have you considered using:
> a) Using SequenceFile (appropriate for binary data, I believe -- but limits
> you to Sequential I/O)
> b) Looking into the ARC file format which someone referenced previously on
> this list
>
> ?
>
> Brian
>
>
> On Jan 25, 2009, at 8:29 PM, Mark Kerzner wrote:
>
>  Thank you, Jason, this is awesome information. I am going to use a
>> balanced
>> directory tree structure, and I am going to make this independent of the
>> other parts of the system, so that I can change it later should practice
>> dictate me to do so.
>>
>> Mark
>>
>> On Sun, Jan 25, 2009 at 8:06 PM, jason hadoop <jason.hadoop@gmail.com
>> >wrote:
>>
>>  With large numbers of files you run the risk of the Datanodes timing out
>>> when they are performing their block report and or DU reports.
>>> Basically if a *find* in the dfs.data.dir takes more than 10 minutes you
>>> will have catastrophic problems with your hdfs.
>>> At attributor with 2million blocks on a datanode, under XFS centos (i686)
>>> 5.1 stock kernels would take 21 minutes with noatime, on a 6 disk raid 5
>>> array. 8way 2.5ghz xeons 8gig ram. Raid controller was a PERC and the
>>> machine basically served hdfs.
>>>
>>>
>>> On Sun, Jan 25, 2009 at 1:49 PM, Mark Kerzner <markkerzner@gmail.com>
>>> wrote:
>>>
>>>  Yes, flip suggested such solution, but his files are text, so he could
>>>> combine them all in a large text file, with each lined representing
>>>>
>>> initial
>>>
>>>> files. My files, however, are binary, so I do not see how I could
>>>> combine
>>>> them.
>>>>
>>>> However, since my numbers are limited by about 1 billion files total, I
>>>> should be OK to put them all in a few directories with under, say,
>>>> 10,000
>>>> files each. Maybe a little balanced tree, but 3-4 four levels should
>>>> suffice.
>>>>
>>>> Thank you,
>>>> Mark
>>>>
>>>> On Sun, Jan 25, 2009 at 11:43 AM, Carfield Yim <
>>>> carfield@carfield.com.hk
>>>>
>>>>> wrote:
>>>>>
>>>>
>>>>  Possible simple having a file large in size instead of having a lot of
>>>>> small files?
>>>>>
>>>>> On Sat, Jan 24, 2009 at 7:03 AM, Mark Kerzner <markkerzner@gmail.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> there is a performance penalty in Windows (pardon the expression)
if
>>>>>>
>>>>> you
>>>>
>>>>> put
>>>>>
>>>>>> too many files in the same directory. The OS becomes very slow, stops
>>>>>>
>>>>> seeing
>>>>>
>>>>>> them, and lies about their status to my Java requests. I do not know
>>>>>>
>>>>> if
>>>
>>>> this
>>>>>
>>>>>> is also a problem in Linux, but in HDFS - do I need to balance a
>>>>>>
>>>>> directory
>>>>>
>>>>>> tree if I want to store millions of files, or can I put them all
in
>>>>>>
>>>>> the
>>>
>>>> same
>>>>>
>>>>>> directory?
>>>>>>
>>>>>> Thank you,
>>>>>> Mark
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message