hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marco Nicosia (JIRA)" <j...@apache.org>
Subject [jira] Created: (HADOOP-4412) hadoop dfs -find feature
Date Tue, 14 Oct 2008 17:47:44 GMT
hadoop dfs -find feature
------------------------

                 Key: HADOOP-4412
                 URL: https://issues.apache.org/jira/browse/HADOOP-4412
             Project: Hadoop Core
          Issue Type: New Feature
          Components: dfs
            Reporter: Marco Nicosia


Both sysadmins and users make frequent use of the unix 'find' command, but Hadoop has no correlate.
Without this, users are writing scripts which make heavy use of hadoop dfs -lsr, and implementing
find one-offs. I think hdfs -lsr is somewhat taxing on the NameNode, and a really slow experience
on the client side. Possibly an in-NameNode find operation would be only a bit more taxing
on the NameNode, but significantly faster from the client's point of view?

The minimum set of options I can think of which would make a Hadoop find command generally
useful is (in priority order):
* -type (file or directory, for now)
* -atime/-ctime-mtime (... and -creationtime?) (both + and - arguments)
* -print0 (for piping to xargs -0)
* -depth
* -owner/-group (and -nouser/-nogroup)
* -name (allowing for shell pattern, or even regex?)
* -perm
* -size

One possible special case, but could possibly be really cool if it ran from within the NameNode:
* -delete
The "hadoop dfs -lsr | hadoop dfs -rm" cycle is really, really slow.

Lower priority, some people do use operators, mostly to execute -or searches such as:
* find / \(-nouser -or -nogroup\)

Finally, I thought I'd include a link to the [Posix spec for find|http://www.opengroup.org/onlinepubs/009695399/utilities/find.html]


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message