hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <qwertyman...@gmail.com>
Subject Re: Efficient query to directory-num-files?
Date Mon, 04 Oct 2010 18:38:23 GMT
On Mon, Oct 4, 2010 at 11:11 PM, Keith Wiley <kwiley@keithwiley.com> wrote:
> - I want to know how many files are in a directory.
> - Well, actually, I want to know how many files are in a few thousand directories.
> - I anticipate the answer to be approximately four million.
> - If I were to pipe "hadoop fs -ls | wc" I estimate a return of about 360MBs of textual
ls data to my client (Each hadoop ls entry is about 90B since it is always "ls -l" style),
when all I really want is the file-count.
> Is there a smarter way to do this?
> Thanks.
> ________________________________________________________________________________
> Keith Wiley               kwiley@keithwiley.com               www.keithwiley.com
> "You can scratch an itch, but you can't itch a scratch. Furthermore, an itch can
> itch but a scratch can't scratch. Finally, a scratch can itch, but an itch can't
> scratch. All together this implies: He scratched the itch from the scratch that
> itched but would never itch the scratch from the itch that scratched."
>  -- Keith Wiley
> ________________________________________________________________________________

There's a "FileSystem.listStatus(...).length" you could use, in Java.

(cook up a utility for it if you need it in commandline. Its what the
FsShell does anyway when you use it via 'hadoop fs/dfs'.)

But I do not know if this will indeed reduce the querying time also,
as it seems to create an array of all the entries under a path. I
could not find a direct counting command, as even the count given by
the FsShell seems to be of this manner. Trying it on some 50,000 items
I created for testing it out seemed quick enough. I wouldn't know
about 4 million though! Try it out and wait for better answers if any!

Harsh J

View raw message