hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Derek Young <dyo...@kayak.com>
Subject Re: using distcp for http source files
Date Wed, 21 Jan 2009 22:01:37 GMT
Tsz Wo (Nicholas), Sze <s29752-hadoopuser@...> writes:

> 
> Hi Derek,
> 
> The "http" in "http://core:7274/logs/log.20090121" should be "hftp".  hftp is 
the scheme name of
> HftpFileSystem which uses http for accessing hdfs.
> 
> Hope this helps.
> 
> Nicholas Sze


I thought hftp is used to talk to servlets that act as a gateway to hdfs 
right?  In my case these will be servers that are serving up static log files, 
running no servlets.  I believe this is the scenario that HADOOP-341 describes: 
"Enhance it [distcp] to handle http as the source protocol i.e. support copying 
files from arbitrary http-based sources into the dfs."

In any case if I just use hftp instead of http I get this error:

bin/hadoop distcp -f hftp://core:7274/logs/log.20090121 /user/dyoung/mylogs

With failures, global counters are inaccurate; consider running with -i
Copy failed: java.io.IOException: Server returned HTTP response code: 400 for 
URL: http://core:7274/data/logs/log.20090121?
ugi=dyoung,dyoung,adm,dialout,fax,cdrom,cdrom,\
floppy,floppy,tape,audio,audio,dip,dip,video,video,\
plugdev,plugdev,admin,users,scanner,fuse,fuse,lpadmin,\
admin,vboxusers
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream
(HttpURLConnection.java:1241)
        at org.apache.hadoop.hdfs.HftpFileSystem.open(HftpFileSystem.java:124)
        at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:359)
        at org.apache.hadoop.tools.DistCp.fetchFileList(DistCp.java:581)
        at org.apache.hadoop.tools.DistCp.access$300(DistCp.java:74)
        at org.apache.hadoop.tools.DistCp$Arguments.valueOf(DistCp.java:775)
        at org.apache.hadoop.tools.DistCp.run(DistCp.java:844)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
        at org.apache.hadoop.tools.DistCp.main(DistCp.java:871)

> 
> ----- Original Message ----
> > From: Derek Young <dyoung@...>
> > To: core-user@...
> > Sent: Wednesday, January 21, 2009 1:23:56 PM
> > Subject: using distcp for http source files
> > 
> > I plan to use hadoop to do some log processing and I'm working on a method 
to 
> > load the files (probably nightly) into hdfs.  My plan is to have a web 
server on 
> > each machine with logs that serves up the log directories.  Then I would 
give 
> > distcp a list of http URLs of the log files and have it copy the files in.
> > 
> > Reading http://issues.apache.org/jira/browse/HADOOP-341 it sounds like this 
> > should be supported, but the http URLs are not working for me.  Are http 
source 
> > URLs still supported?
> > 
> > I tried a simple test with an http source URL (using Hadoop 0.19):
> > 
> > hadoop distcp -f http://core:7274/logs/log.20090121 /user/dyoung/mylogs
> > 
> > This fails:
> > 
> > With failures, global counters are inaccurate; consider running with -i
> > Copy failed: java.io.IOException: No FileSystem for scheme: http
> >    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1364)
> >    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:56)
> >    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1379)
> >    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
> >    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
> >    at org.apache.hadoop.tools.DistCp.fetchFileList(DistCp.java:578)
> >    at org.apache.hadoop.tools.DistCp.access$300(DistCp.java:74)
> >    at org.apache.hadoop.tools.DistCp$Arguments.valueOf(DistCp.java:775)
> >    at org.apache.hadoop.tools.DistCp.run(DistCp.java:844)
> >    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> >    at org.apache.hadoop.tools.DistCp.main(DistCp.java:871)
> 
> 





Mime
View raw message