Return-Path: X-Original-To: apmail-hadoop-common-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-common-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DCB04DE8E for ; Fri, 9 Nov 2012 19:08:14 +0000 (UTC) Received: (qmail 19131 invoked by uid 500); 9 Nov 2012 19:08:14 -0000 Delivered-To: apmail-hadoop-common-issues-archive@hadoop.apache.org Received: (qmail 19089 invoked by uid 500); 9 Nov 2012 19:08:14 -0000 Mailing-List: contact common-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: common-issues@hadoop.apache.org Delivered-To: mailing list common-issues@hadoop.apache.org Received: (qmail 19080 invoked by uid 99); 9 Nov 2012 19:08:14 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 09 Nov 2012 19:08:14 +0000 Date: Fri, 9 Nov 2012 19:08:14 +0000 (UTC) From: "Jonathan Allen (JIRA)" To: common-issues@hadoop.apache.org Message-ID: <1910679763.93318.1352488094545.JavaMail.jiratomcat@arcas> Subject: [jira] [Updated] (HADOOP-8989) hadoop dfs -find feature MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HADOOP-8989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Allen updated HADOOP-8989: ----------------------------------- Attachment: HADOOP-8989.patch Not complete but what's there is stable and could be reviewed if somebody wants to look at it. I'll just be adding expressions and tidying up the documentation now. The following expressions are implemented as per the posix definition: -a, -atime, -depth, -group, -mtime, -name, -newer, -nogroup, -not, -nouser, -o, -perm, -print, -prune, -size, -type, -user. I haven't included the following posix expressions as they don't look applicable here: -xdev, -links, -ctime. I've still got to add -exec and any other non-posix extensions that look useful. > hadoop dfs -find feature > ------------------------ > > Key: HADOOP-8989 > URL: https://issues.apache.org/jira/browse/HADOOP-8989 > Project: Hadoop Common > Issue Type: New Feature > Reporter: Marco Nicosia > Assignee: Jonathan Allen > Attachments: HADOOP-8989.patch, HADOOP-8989.patch > > > Both sysadmins and users make frequent use of the unix 'find' command, but Hadoop has no correlate. Without this, users are writing scripts which make heavy use of hadoop dfs -lsr, and implementing find one-offs. I think hdfs -lsr is somewhat taxing on the NameNode, and a really slow experience on the client side. Possibly an in-NameNode find operation would be only a bit more taxing on the NameNode, but significantly faster from the client's point of view? > The minimum set of options I can think of which would make a Hadoop find command generally useful is (in priority order): > * -type (file or directory, for now) > * -atime/-ctime-mtime (... and -creationtime?) (both + and - arguments) > * -print0 (for piping to xargs -0) > * -depth > * -owner/-group (and -nouser/-nogroup) > * -name (allowing for shell pattern, or even regex?) > * -perm > * -size > One possible special case, but could possibly be really cool if it ran from within the NameNode: > * -delete > The "hadoop dfs -lsr | hadoop dfs -rm" cycle is really, really slow. > Lower priority, some people do use operators, mostly to execute -or searches such as: > * find / \(-nouser -or -nogroup\) > Finally, I thought I'd include a link to the [Posix spec for find|http://www.opengroup.org/onlinepubs/009695399/utilities/find.html] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira