hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Parks" <davidpark...@yahoo.com>
Subject Large input files via HTTP
Date Mon, 22 Oct 2012 08:54:40 GMT
I want to create a MapReduce job which reads many multi-gigabyte input files
from various HTTP sources & processes them nightly.
  
Is there a reasonably flexible way to acquire the files in the Hadoop job
its self? I expect the initial downloads to take many hours and I'd hope I
can optimize the # of connections (example: I'm limited to 5 connections to
one host, whereas another host has a 3-connection limit, so maximize as much
as possible).  Also the set of files to download will change a little over
time so the input list should be easily configurable (in a config file or
equivalent).
  
 - Is it normal to perform batch downloads like this *before* running the
mapreduce job? 
 - Or is it ok to include such steps in with the job? 
 - It seems handy to keep the whole process as one neat package in Hadoop if
possible.  
 - What class should I implement if I wanted to manage this myself? Would I
just extend TextInputFormat for example, and do the HTTP processing there?
Or am I created a FileSystem?

Thanks,
David



Mime
View raw message