hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zak Stone <zst...@gmail.com>
Subject Re: Parallelizing HTTP calls with Hadoop
Date Sun, 07 Mar 2010 14:30:39 GMT
Hi Phil,

If you treat each HTTP request as a Hadoop task and the individual
HTTP responses are small, you may find that the latency of the web
service leaves most of your Hadoop processes idle most of the time.

To avoid this problem, you can let each mapper make many HTTP requests
in parallel, either using asynchronous programming or using threads.
For example, each mapper could load batches of URLs from Hadoop into
an internal work queue, and 100 threads per mapper could pull URLs off
the work queue and push the HTTP responses onto another in-memory
output queue. A separate thread could then steadily take items from
the output queue and stream them back to Hadoop as key-value pairs.

Hope that helps,
Zak


On Sun, Mar 7, 2010 at 7:54 AM, Phil McCarthy <philmccarthy@gmail.com> wrote:
> Hi,
>
> I'm new to Hadoop, and I'm trying to figure out the best way to use it
> to parallelize a large number of calls to a web API, and then process
> and store the results.
>
> The calls will be regular HTTP requests, and the URLs follow a known
> format, so can be generated easily. I'd like to understand how to
> apply the MapReduce pattern to this task – should I have one mapper
> generating URLs, and another making the HTTP calls and mapping request
> URLs to their response documents, for example?
>
> Any links to sample code, examples etc. would be great.
>
> Cheers,
> Phil
>

Mime
View raw message