hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "wolfgang hoschek (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-8989) hadoop dfs -find feature
Date Wed, 28 Nov 2012 01:01:25 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-8989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13505130#comment-13505130
] 

wolfgang hoschek commented on HADOOP-8989:
------------------------------------------

In addition, would be good to have a -starttime <dateTime> <dateTimePattern> option
to express the time related options (e.g. -mmin, -mtime, -amin, -atime) to work relative to
a specific absolute timestamp (say, give me all files modified since yesterday midnight) instead
of relative to whatever "now" happens to be at the time of command execution. The dateTimePattern
might be ISO 8601 by default, and could be set to any java.text.SimpleDateFormat format pattern.


The FindOptions class already seems to foresee such usage. It's just a matter of exposing
it at the command level.
                
> hadoop dfs -find feature
> ------------------------
>
>                 Key: HADOOP-8989
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8989
>             Project: Hadoop Common
>          Issue Type: New Feature
>            Reporter: Marco Nicosia
>            Assignee: Jonathan Allen
>         Attachments: HADOOP-8989.patch, HADOOP-8989.patch, HADOOP-8989.patch
>
>
> Both sysadmins and users make frequent use of the unix 'find' command, but Hadoop has
no correlate. Without this, users are writing scripts which make heavy use of hadoop dfs -lsr,
and implementing find one-offs. I think hdfs -lsr is somewhat taxing on the NameNode, and
a really slow experience on the client side. Possibly an in-NameNode find operation would
be only a bit more taxing on the NameNode, but significantly faster from the client's point
of view?
> The minimum set of options I can think of which would make a Hadoop find command generally
useful is (in priority order):
> * -type (file or directory, for now)
> * -atime/-ctime-mtime (... and -creationtime?) (both + and - arguments)
> * -print0 (for piping to xargs -0)
> * -depth
> * -owner/-group (and -nouser/-nogroup)
> * -name (allowing for shell pattern, or even regex?)
> * -perm
> * -size
> One possible special case, but could possibly be really cool if it ran from within the
NameNode:
> * -delete
> The "hadoop dfs -lsr | hadoop dfs -rm" cycle is really, really slow.
> Lower priority, some people do use operators, mostly to execute -or searches such as:
> * find / \(-nouser -or -nogroup\)
> Finally, I thought I'd include a link to the [Posix spec for find|http://www.opengroup.org/onlinepubs/009695399/utilities/find.html]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message