hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@hortonworks.com>
Subject Re: Large input files via HTTP
Date Mon, 22 Oct 2012 08:47:56 GMT
data ingress is often done as an initial MR job.

Here is sounds like you'd need
-a list of URLs, which you can have a single mapper run through and  map to

which feeds to the reducer:

hostname [url1, url2, ..]

the reducer on each hostname key can do the GET operations for that host,
using whatever per-host limits you have. Remember to keep sending
heartbeats to the Task Tracker so it knows that your process is alive. Oh,
and see if you grab any content-length and checksum header keys to verify
at the end of a long download -you don't want to accidentally pull a
half-complete D/L into your work.

once the files are in HDFS you can do more work on them, which is where
something like an OOzie workflow can be handy.

On 22 October 2012 09:40, David Parks <davidparks21@yahoo.com> wrote:

> I want to create a MapReduce job which reads many multi-gigabyte input
> files
> from various HTTP sources & processes them nightly.
> Is there a reasonably flexible way to do this in the Hadoop job its self? I
> expect the initial downloads to take many hours and I'd hope I can optimize
> the # of connections (example: I'm limited to 5 connections to one host,
> whereas another host has a 3-connection limit, so maximize as much as
> possible).  Also the set of files to download will change a little over
> time
> so the input list should be easily configurable (in a config file or
> equivalent).
> Is it normal to perform batch downloads like this before running the
> mapreduce job? Or is it ok to include such steps in with the job? It seems
> handy to keep the whole process as one neat package in Hadoop if possible.

View raw message