hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Allen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-8989) hadoop dfs -find feature
Date Wed, 18 Jun 2014 20:16:27 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-8989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14036295#comment-14036295
] 

Jonathan Allen commented on HADOOP-8989:
----------------------------------------

[~daryn]

bq. Do you actually need the changes to Command and CommandFactory to add a new command?

Yes, I think I do.

The change to Command#processPaths introduces a hook to allow recursion through something
other than directories, specifically this allows Find to recurse through symbolic links when
required. This could be implemented by overriding the whole processPaths method in Find but
this seems likely to introduce maintenance problems going forwards (and in a [previous comment|https://issues.apache.org/jira/browse/HADOOP-8989?focusedCommentId=13585968&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13585968]
you suggested that this shouldn't be done).

The CommandFactory is added into the Command so that it can be used by the -exec expression
(it needs to know what class to pass the found files into). It seemed cleaner to do it this
way than to call FsShell.run for each file.

They both seem fairly simple and safe changes that could be useful to other commands in the
future.

bq. I'm not sure the path handling/expansion is consistent with the rest of the commands.
Find is a bit odd in that paths come before all options, but I think there has to be a cleaner
way to implement the parsing.

Possibly, but I couldn't think of one. If you can see something I've missed then feel free
to suggest it.

> hadoop dfs -find feature
> ------------------------
>
>                 Key: HADOOP-8989
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8989
>             Project: Hadoop Common
>          Issue Type: New Feature
>            Reporter: Marco Nicosia
>            Assignee: Jonathan Allen
>         Attachments: HADOOP-8989.patch, HADOOP-8989.patch, HADOOP-8989.patch, HADOOP-8989.patch,
HADOOP-8989.patch, HADOOP-8989.patch, HADOOP-8989.patch, HADOOP-8989.patch, HADOOP-8989.patch,
HADOOP-8989.patch, HADOOP-8989.patch, HADOOP-8989.patch, HADOOP-8989.patch, HADOOP-8989.patch,
HADOOP-8989.patch, HADOOP-8989.patch
>
>
> Both sysadmins and users make frequent use of the unix 'find' command, but Hadoop has
no correlate. Without this, users are writing scripts which make heavy use of hadoop dfs -lsr,
and implementing find one-offs. I think hdfs -lsr is somewhat taxing on the NameNode, and
a really slow experience on the client side. Possibly an in-NameNode find operation would
be only a bit more taxing on the NameNode, but significantly faster from the client's point
of view?
> The minimum set of options I can think of which would make a Hadoop find command generally
useful is (in priority order):
> * -type (file or directory, for now)
> * -atime/-ctime-mtime (... and -creationtime?) (both + and - arguments)
> * -print0 (for piping to xargs -0)
> * -depth
> * -owner/-group (and -nouser/-nogroup)
> * -name (allowing for shell pattern, or even regex?)
> * -perm
> * -size
> One possible special case, but could possibly be really cool if it ran from within the
NameNode:
> * -delete
> The "hadoop dfs -lsr | hadoop dfs -rm" cycle is really, really slow.
> Lower priority, some people do use operators, mostly to execute -or searches such as:
> * find / \(-nouser -or -nogroup\)
> Finally, I thought I'd include a link to the [Posix spec for find|http://www.opengroup.org/onlinepubs/009695399/utilities/find.html]



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message