hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Douglas (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3173) inconsistent globbing support for dfs commands
Date Sat, 17 May 2008 01:01:57 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12597672#action_12597672
] 

Chris Douglas commented on HADOOP-3173:
---------------------------------------

I mostly agree with Hairong; this is easy to do programmatically, and while there are a few
alternatives (different escape character, URI encoding, new "literal" FsShell commands, etc),
most appear to make the general case worse to accommodate a fairly esoteric use.

On the other hand, there are only a few places (FsShell and FileInputFormat, mainly) where
we call globStatus, and in each case a String is converted to a Path before being converted
back into a String in globStatus. Without the conversion, the pattern syntax can mandate that
the path separator must be '/' independent of the Path syntax. Unfortunately, actually effecting
this change is awkward, primarily because one must still create a Path of the glob string
to obtain the FileSystem to resolve it against. If the glob string creates a Path to be resolved
against a FileSystem other than the default, then the scheme, authority, etc. must be excised
from the original string to preserve the escaping, etc., which will ultimately duplicate much
of the URI parsing that's already happening in Path. Particularly for FileInputFormat and
its users, pulling out all the Path dependencies (i.e. changing users of the globbing API)
is a huge job with a modest payback.

Since Path(String) already isolates this segment, we could introduce Path::getRawPath that
would preserve the path before Path::normalizePath and URI::normalize. With this, globStatus
would resolve Path::getRawPath instead of p.toUri().getPath(). Unfortunately, this would mean
that globStatus(p) might return different results than globstatus(new Path(p.toString())),
which means FileInputFormat would still have this issue. Even if Path(Path, String) and variants
preserved a raw path, its semantics would be unclear. In Path(Path, String), is the raw Path
only eq the raw Path from the second arg if it is absolute? Is it the raw path from the first
arg preserved in some way? We could just assert that the raw path is only different from p.toUri().getPath()
if it was created with Path(String), but this could be confusing when creating globs from
a base path (i.e. Path(Path, String) or possibly more confusing, Path(String, Path)). The
URI normalization also removes all the ".." and "." entries in the Path, which the regexp
would then have to handle (e.g. "a/b/../c*" is resolved to "a/c*" now, but using the raw path,
GlobFilter would accept "a/b/dd/c" since '.' matches GlobFilter::PAT_ANY). That said, FileInputFormats
and all Strings that were once Paths wouldn't have to deal with this, while utilities like
FsShell could match "a/b/../c" as regexps, which might not be a bad thing.

If we want to fix this, I'd propose adding Path::getRawPath which would be used in FileSystem::globStatus,
but could only be different from p.getUri().getPath() when the Path was created from a String.
This covers cases where one wants to create a Path regexp manually and use it as a glob (as
in FsShell), but should not change behavior elsewhere.

Thoughts?

> inconsistent globbing support for dfs commands
> ----------------------------------------------
>
>                 Key: HADOOP-3173
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3173
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>         Environment: Hadoop 0.16.1
>            Reporter: Rajiv Chittajallu
>             Fix For: 0.18.0
>
>
> hadoop dfs -mkdir /user/*/bar creates a directory "/user/*/bar" and you cant deleted
/user/* as -rmr expands the glob
> $ hadoop dfs -mkdir /user/rajive/a/*/foo
> $ hadoop dfs -ls /user/rajive/a
> Found 4 items
> /user/rajive/a/*	<dir>		2008-04-04 16:09	rwx------	rajive	users
> /user/rajive/a/b	<dir>		2008-04-04 16:08	rwx------	rajive	users
> /user/rajive/a/c	<dir>		2008-04-04 16:08	rwx------	rajive	users
> /user/rajive/a/d	<dir>		2008-04-04 16:08	rwx------	rajive	users
> $ hadoop dfs -ls /user/rajive/a/*
> /user/rajive/a/*/foo	<dir>		2008-04-04 16:09	rwx------	rajive	users
> $ hadoop dfs -rmr /user/rajive/a/*
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/*
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/b
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/c
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/d
> I am not able to escape '*' from being expanded.
> $ hadoop dfs -rmr '/user/rajive/a/*'
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/*
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/b
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/c
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/d
> $ hadoop dfs -rmr  '/user/rajive/a/\*'
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/*
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/b
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/c
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/d
> $ hadoop dfs -rmr  /user/rajive/a/\* 
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/*
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/b
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/c
> Moved to trash: hdfs://namenode-1:8020/user/rajive/a/d

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message