From Keith Wiley <kwi...@keithwiley.com>
Subject Re: Efficient query to directory-num-files?
Date Wed, 06 Oct 2010 05:56:42 GMT

On 2010, Oct 04, at 11:38 AM, Harsh J wrote:

> On Mon, Oct 4, 2010 at 11:11 PM, Keith Wiley <kwiley@keithwiley.com>  
> wrote:
>> - I want to know how many files are in a directory.
>> - Well, actually, I want to know how many files are in a few  
>> thousand directories.
>> - I anticipate the answer to be approximately four million.
>> - If I were to pipe "hadoop fs -ls | wc" I estimate a return of  
>> about 360MBs of textual ls data to my client (Each hadoop ls entry  
>> is about 90B since it is always "ls -l" style), when all I really  
>> want is the file-count.
>> Is there a smarter way to do this?
>> Thanks.
> There's a "FileSystem.listStatus(...).length" you could use, in Java.
> (cook up a utility for it if you need it in commandline. Its what the
> FsShell does anyway when you use it via 'hadoop fs/dfs'.)
> But I do not know if this will indeed reduce the querying time also,
> as it seems to create an array of all the entries under a path. I
> could not find a direct counting command, as even the count given by
> the FsShell seems to be of this manner. Trying it on some 50,000 items
> I created for testing it out seemed quick enough. I wouldn't know
> about 4 million though! Try it out and wait for better answers if any!
> :)

Thanks, I'll take a look at that and see what I can do with it.


Keith Wiley     kwiley@keithwiley.com     keithwiley.com     

"What I primarily learned in grad school is how much I *don't* know.
Consequently, I left grad school with a higher ignorance to knowledge  
ratio than
when I entered."
                                            --  Keith Wiley

