hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yoram Arnon" <yar...@yahoo-inc.com>
Subject RE: [jira] Created: (HADOOP-713) dfs list operation is too expensive
Date Wed, 15 Nov 2006 22:06:13 GMT
I agree with all that, except that that's how the ls command works now,
performance issues and all, and that will change only when we fix
HADOOP-713. Until then, using that field is free - it's being computed
anyway.

That said, HADOOP-713 not a current pain point. Users running ls is pretty
much a non issue, since it's a rare operation, and it takes a fraction of a
second on the name node with our largish dfs. M-R jobs don't really pay a
penalty for this behaviour, since they normally execute on the last level of
the tree anyway, where the current behaviour is desirable. 
With all that in mind, the bug may stay in the queue for a while, until more
important issues are addressed.
Until then, we may as well get a better UI.

Yoram

> -----Original Message-----
> From: Eric Baldeschwieler [mailto:eric14@yahoo-inc.com] 
> Sent: Wednesday, November 15, 2006 1:11 PM
> To: hadoop-dev@lucene.apache.org
> Subject: Re: [jira] Created: (HADOOP-713) dfs list operation 
> is too expensive
> 
> It is not free.  As I understand it, we are recursively walking the  
> namespace tree with every ls to get this.
> 
> This is not a scalable design.  Even posix doesn't do this!
> 
> This is a performance problem that will only get worse.  I suggest  
> removing this performance mistake and documenting the existence of  
> dfs -du, which is a rather familiar solution to most users.
> 
> On Nov 15, 2006, at 12:19 PM, Yoram Arnon wrote:
> 
> >  I opt for displaying the size in bytes for now, since it's  
> > computed anyway,
> > is readily available for free, and improves the UI.
> > If/when we fix HADOOP-713 we can replace the computation of size  
> > with a
> > better value for #files.
> > Let's not prevent an improvement just because it might change in  
> > the future.
> > Yoram
> >
> >> -----Original Message-----
> >> From: Eric Baldeschwieler [mailto:eric14@yahoo-inc.com]
> >> Sent: Tuesday, November 14, 2006 7:10 PM
> >> To: hadoop-dev@lucene.apache.org
> >> Subject: Re: [jira] Created: (HADOOP-713) dfs list operation
> >> is too expensive
> >>
> >> So let's display nothing for now and revisit this once we have a
> >> cleaner CRC story.
> >>
> >>
> >> On Nov 14, 2006, at 10:55 AM, Hairong Kuang wrote:
> >>
> >>> Setting the size of a directory to be the # of files is a good
> >>> idea. But the
> >>> problem is that dfs name node has no idea of checksum
> >> files. So the
> >>> number
> >>> of files include that of checksum files. But what's displayed at
> >>> the client
> >>> side has filtered out the checksum files. So the # of files does
> >>> not match
> >>> what's really displayed at the client side.
> >>>
> >>> Hairong
> >>>
> >>> -----Original Message-----
> >>> From: Arkady Borkovsky [mailto:arkady@yahoo-inc.com]
> >>> Sent: Monday, November 13, 2006 5:07 PM
> >>> To: hadoop-dev@lucene.apache.org
> >>> Subject: Re: [jira] Created: (HADOOP-713) dfs list 
> operation is too
> >>> expensive
> >>>
> >>> When listing a directory, for directory entries it may be more
> >>> useful to
> >>> display the number of files in a directory, rather than the number
> >>> of bytes
> >>> used by all the files in the directory and its subdirectories.
> >>> This a subjective opinion -- comments?
> >>>
> >>> (Currently, the value displayed subdirectory is "0")
> >>>
> >>> On Nov 13, 2006, at 3:25 PM, Hairong Kuang (JIRA) wrote:
> >>>
> >>>> dfs list operation is too expensive
> >>>> -----------------------------------
> >>>>
> >>>>                  Key: HADOOP-713
> >>>>                  URL:
> >> http://issues.apache.org/jira/browse/HADOOP-713
> >>>>              Project: Hadoop
> >>>>           Issue Type: Improvement
> >>>>           Components: dfs
> >>>>     Affects Versions: 0.8.0
> >>>>             Reporter: Hairong Kuang
> >>>>
> >>>>
> >>>> A list request to dfs returns an array of DFSFileInfo. A
> >> DFSFileInfo
> >>>> of a directory contains a field called contentsLen, 
> indicating its
> >>>> size  which gets computed at the namenode side by 
> resursively going
> >>>> through its subdirs. At the same time, the whole dfs directory
> >>>> tree is
> >>>> locked.
> >>>>
> >>>> The list operation is used a lot by DFSClient for listing a
> >>>> directory,
> >>>> getting a file's size and # of replicas, and getting the
> >> size of dfs.
> >>>> Only the last operation needs the field contentsLen to 
> be computed.
> >>>>
> >>>> To reduce its cost, we can add a flag to the list request.
> >>>> ContentsLen
> >>>> is computed If the flag is set. By default, the flag is false.
> >>>>
> >>>> --
> >>>> This message is automatically generated by JIRA.
> >>>> -
> >>>> If you think it was sent incorrectly contact one of the
> >>>> administrators:
> >>>> http://issues.apache.org/jira/secure/Administrators.jspa
> >>>> -
> >>>> For more information on JIRA, see:
> >>>> http://www.atlassian.com/software/jira
> >>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >
> 
> 


Mime
View raw message