hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Baldeschwieler <eri...@yahoo-inc.com>
Subject Re: [jira] Created: (HADOOP-713) dfs list operation is too expensive
Date Wed, 15 Nov 2006 21:11:23 GMT
It is not free.  As I understand it, we are recursively walking the  
namespace tree with every ls to get this.

This is not a scalable design.  Even posix doesn't do this!

This is a performance problem that will only get worse.  I suggest  
removing this performance mistake and documenting the existence of  
dfs -du, which is a rather familiar solution to most users.

On Nov 15, 2006, at 12:19 PM, Yoram Arnon wrote:

>  I opt for displaying the size in bytes for now, since it's  
> computed anyway,
> is readily available for free, and improves the UI.
> If/when we fix HADOOP-713 we can replace the computation of size  
> with a
> better value for #files.
> Let's not prevent an improvement just because it might change in  
> the future.
> Yoram
>
>> -----Original Message-----
>> From: Eric Baldeschwieler [mailto:eric14@yahoo-inc.com]
>> Sent: Tuesday, November 14, 2006 7:10 PM
>> To: hadoop-dev@lucene.apache.org
>> Subject: Re: [jira] Created: (HADOOP-713) dfs list operation
>> is too expensive
>>
>> So let's display nothing for now and revisit this once we have a
>> cleaner CRC story.
>>
>>
>> On Nov 14, 2006, at 10:55 AM, Hairong Kuang wrote:
>>
>>> Setting the size of a directory to be the # of files is a good
>>> idea. But the
>>> problem is that dfs name node has no idea of checksum
>> files. So the
>>> number
>>> of files include that of checksum files. But what's displayed at
>>> the client
>>> side has filtered out the checksum files. So the # of files does
>>> not match
>>> what's really displayed at the client side.
>>>
>>> Hairong
>>>
>>> -----Original Message-----
>>> From: Arkady Borkovsky [mailto:arkady@yahoo-inc.com]
>>> Sent: Monday, November 13, 2006 5:07 PM
>>> To: hadoop-dev@lucene.apache.org
>>> Subject: Re: [jira] Created: (HADOOP-713) dfs list operation is too
>>> expensive
>>>
>>> When listing a directory, for directory entries it may be more
>>> useful to
>>> display the number of files in a directory, rather than the number
>>> of bytes
>>> used by all the files in the directory and its subdirectories.
>>> This a subjective opinion -- comments?
>>>
>>> (Currently, the value displayed subdirectory is "0")
>>>
>>> On Nov 13, 2006, at 3:25 PM, Hairong Kuang (JIRA) wrote:
>>>
>>>> dfs list operation is too expensive
>>>> -----------------------------------
>>>>
>>>>                  Key: HADOOP-713
>>>>                  URL:
>> http://issues.apache.org/jira/browse/HADOOP-713
>>>>              Project: Hadoop
>>>>           Issue Type: Improvement
>>>>           Components: dfs
>>>>     Affects Versions: 0.8.0
>>>>             Reporter: Hairong Kuang
>>>>
>>>>
>>>> A list request to dfs returns an array of DFSFileInfo. A
>> DFSFileInfo
>>>> of a directory contains a field called contentsLen, indicating its
>>>> size  which gets computed at the namenode side by resursively going
>>>> through its subdirs. At the same time, the whole dfs directory
>>>> tree is
>>>> locked.
>>>>
>>>> The list operation is used a lot by DFSClient for listing a
>>>> directory,
>>>> getting a file's size and # of replicas, and getting the
>> size of dfs.
>>>> Only the last operation needs the field contentsLen to be computed.
>>>>
>>>> To reduce its cost, we can add a flag to the list request.
>>>> ContentsLen
>>>> is computed If the flag is set. By default, the flag is false.
>>>>
>>>> --
>>>> This message is automatically generated by JIRA.
>>>> -
>>>> If you think it was sent incorrectly contact one of the
>>>> administrators:
>>>> http://issues.apache.org/jira/secure/Administrators.jspa
>>>> -
>>>> For more information on JIRA, see:
>>>> http://www.atlassian.com/software/jira
>>>>
>>>>
>>>
>>>
>>
>>
>


Mime
View raw message