hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Kimball <aa...@cloudera.com>
Subject Re: Parallelizing HTTP calls with MapReduce
Date Mon, 08 Mar 2010 18:46:07 GMT
I think you should actually use the Java-based MapReduce here.

As has been noted, these will be network-bound calls. And if you're trying
to make a lot of them, my experience is that individual calls are slow.
10,000 GET requests could each take a second or two, especially if they
involve DNS lookups. But they can be overlapped.

If you're using the old API, consider using the Multithreaded maprunner for
this (I think that's org.apache.hadoop.mapred.lib.MultithreadedMapRunner):

JobConf job = new JobConf();

 If you're using the new API, there's an analagous
o.a.h.mapreduce.lib.mapper.MultithreadedMapper that you should extend.

This will allow you to pipeline all those requests and get much faster
throughput. (Each map task starts a thread pool of a few threads, which will
be given individual map inputs in an overlapped fashion. The same instance
of your Mapper class will be used across all threads, so make sure to
protect any instance variables.)

For maximum efficiency, sort all your different URLs by hostname first, so
that each split of the input contains all the requests to the same server --
this will allow your DNS caching to be much more efficient (rather than have
all your mappers try to DNS lookup the same set of hosts).

Of course, you want to be careful with something like this. A big Hadoop
cluster can easily bring a web server to its knees if you're using too many
map tasks in parallel on the same target :) You may want to actually do some
rate-limiting of requests to the same node... but how to do that easily is a
separate discussion.

- Aaron

On Sun, Mar 7, 2010 at 9:46 AM, Erez Katz <erez_katz@yahoo.com> wrote:

> It should be very easy. If you just have say a list of URLS as input...
> It is not even map-reduce task... just map task (with no reduce, i don't
> see where you do a reduce on a key in this scenario).
> Look for map only tasks in the streaming documentation.
> Just pick your favorite scripting language that keeps reading urls form the
> standard input stream line by line and outputs the result to the standard
> output.
> ala python:
> import urllib,sys
> for line in sys.stdin:
>  url = line.strip()
>  x = urllib.urlopen(url)
>  print x.read()
>  u.close()
> That's all folks.
> No real reason to use Java/C++ here, most of the time will be spend over
> network IO.
> Cheers,
>  Erez Katz
> --- On Sat, 3/6/10, Phil McCarthy <philmccarthy@gmail.com> wrote:
> > From: Phil McCarthy <philmccarthy@gmail.com>
> > Subject: Parallelizing HTTP calls with MapReduce
> > To: mapreduce-user@hadoop.apache.org
> > Date: Saturday, March 6, 2010, 9:29 AM
> > Hi,
> >
> > I'm new to Hadoop, and I'm trying to figure out the best
> > way to use it
> > with EC2 to make large number of calls to a web API, and
> > then process
> > and store the results. I'm completely new to Hadoop, so I'm
> > wondering
> > what's the best high-level approach, in terms of using
> > MapReduce to
> > parallelize the process. The calls will be regular HTTP
> > requests, and
> > the URLs follow a known format, so can be generated
> > easily.
> >
> > This seems like it'd be a pretty common type of task, so
> > apologies if
> > I've missed something obvious in the docs etc.
> >
> > Cheers,
> > Phil McCarthy
> >

View raw message