hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sagar Naik <sn...@attributor.com>
Subject Re: HDFS - millions of files in one directory?
Date Wed, 28 Jan 2009 07:29:32 GMT

System with: 1 billion small files.
Namenode will need to maintain the data-structure for all those files.
System will have atleast 1 block per file. And if u have replication 
factor set to 3, the system will have 3 billion blocks.
Now , if you try to read all these files in a job , you will be making 
as many as 1 billion socket connections to get these blocks. (Big 
Brothers, correct me if I m wrong)

Datanodes routinely check for available disk space and collect block 
reports. These operations are directly dependent on number of blocks on 
a datanode.

Getting all data in one file, avoids all this unnecessary  IO and memory 
occupied by namenode

Number of maps in map-reduce job are based on number of blocks. In case 
of multiple files, we will have a large number of map-tasks.

-Sagar


Mark Kerzner wrote:
> Carfield,
>
> you might be right, and I may be able to combine them in one large file.
> What would one use for a delimiter, so that it would never be encountered in
> normal binary files? Performance does matter (rarely it doesn't). What are
> the differences in performance between using multiple files and one large
> file? I would guess that one file should in fact give better hardware/OS
> performance, because it is more predictable and allows buffering.
>
> thank you,
> Mark
>
> On Sun, Jan 25, 2009 at 9:50 PM, Carfield Yim <carfield@carfield.com.hk>wrote:
>
>   
>> Really? I thought any file can be combines as long as you can figure
>> out an delimiter is ok, and you really cannot have some delimiters?
>> Like "XXXXXXXXX"? And in the worst case, or if performance is not
>> really a matter, may be just encode all binary to and from ascii?
>>
>> On Mon, Jan 26, 2009 at 5:49 AM, Mark Kerzner <markkerzner@gmail.com>
>> wrote:
>>     
>>> Yes, flip suggested such solution, but his files are text, so he could
>>> combine them all in a large text file, with each lined representing
>>>       
>> initial
>>     
>>> files. My files, however, are binary, so I do not see how I could combine
>>> them.
>>>
>>> However, since my numbers are limited by about 1 billion files total, I
>>> should be OK to put them all in a few directories with under, say, 10,000
>>> files each. Maybe a little balanced tree, but 3-4 four levels should
>>> suffice.
>>>
>>> Thank you,
>>> Mark
>>>
>>> On Sun, Jan 25, 2009 at 11:43 AM, Carfield Yim <carfield@carfield.com.hk
>>> wrote:
>>>
>>>       
>>>> Possible simple having a file large in size instead of having a lot of
>>>> small files?
>>>>
>>>> On Sat, Jan 24, 2009 at 7:03 AM, Mark Kerzner <markkerzner@gmail.com>
>>>> wrote:
>>>>         
>>>>> Hi,
>>>>>
>>>>> there is a performance penalty in Windows (pardon the expression) if
>>>>>           
>> you
>>     
>>>> put
>>>>         
>>>>> too many files in the same directory. The OS becomes very slow, stops
>>>>>           
>>>> seeing
>>>>         
>>>>> them, and lies about their status to my Java requests. I do not know
>>>>>           
>> if
>>     
>>>> this
>>>>         
>>>>> is also a problem in Linux, but in HDFS - do I need to balance a
>>>>>           
>>>> directory
>>>>         
>>>>> tree if I want to store millions of files, or can I put them all in
>>>>>           
>> the
>>     
>>>> same
>>>>         
>>>>> directory?
>>>>>
>>>>> Thank you,
>>>>> Mark
>>>>>           
>
>   

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message