hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sameer Paranjpye (JIRA)" <j...@apache.org>
Subject [jira] Created: (HADOOP-495) distcp defeciencies and bugs
Date Wed, 30 Aug 2006 16:42:24 GMT
distcp defeciencies and bugs
----------------------------

                 Key: HADOOP-495
                 URL: http://issues.apache.org/jira/browse/HADOOP-495
             Project: Hadoop
          Issue Type: Bug
          Components: dfs
    Affects Versions: 0.5.0
            Reporter: Sameer Paranjpye
         Assigned To: Arun C Murthy
             Fix For: 0.6.0


distcp as currently implemented has several defeciencies and bugs which I encountered when
trying to use it to import logs from HTTP servers into my local DFS cluster. In general, it
is user unfriendly and does not do comprehensible error reporting. 
Here's a list of things that can be improved:

1) There isn't a man page that explains the various command line options. We should have one.
2) Malformed URLs cause a NullPointerException to be thrown with no error message stating
what went wrong
3) Relative paths for the local filesystem are not handled at all
4) The schema used for HDFS URLs is dfs:// it ought to be hdfs://, 'dfs' is far to general
an acronym to use in URLs
5) If a copy to the local filesystem is specified with a relative path, for instance
    ./bin/hadoop distcp dfs://localhost:8020/foo.txt foo.txt
then the job runs successfully but the file is nowhere to be seen. It looks like this gets
copied to the map/reduce jobs
current working directory
6) If a copy to a dfs is specified and the namenode cannot be resolved, the job fails with
an IOException, no comprehensible error message is printed
7) If an HTTP URI has a query component, it is disregarded when constructing the destination
file name, for instance, if one specifies the following two URLs to be copied in a file list
  http://myhost.mydomain.com/files.cgi?n=/logs/foo.txt
  http://myhost.myfomain.com/files.cgi?n=/logs/bar.txt

a single file called 'files.cgi' is created and is overwritten by one or both source files,
it's not clear which. The destination
path name should be constructed in the way that 'wget' does it, using the filename+query part
of the URL, escaping characters as necessary.

8) It looks like if a list of URLs is specified in a file distcp runs a separate map reduce
job for each entry in the file, why?
Seems like one could do a straight copy for local files since the task needs to run locally,
followed by a single MR job that
copies HDFS and http URLs



-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message