hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Kerzner <markkerz...@gmail.com>
Subject Re: Parallelizing HTTP calls with Hadoop
Date Sun, 07 Mar 2010 14:34:45 GMT
Phil,

what you are describing is close to what Nutch is already doing. You can
look at it - all this coding is non-trivial, and you can save yourself a lot
of work and debugging.

Mark

On Sun, Mar 7, 2010 at 8:30 AM, Zak Stone <zstone@gmail.com> wrote:

> Hi Phil,
>
> If you treat each HTTP request as a Hadoop task and the individual
> HTTP responses are small, you may find that the latency of the web
> service leaves most of your Hadoop processes idle most of the time.
>
> To avoid this problem, you can let each mapper make many HTTP requests
> in parallel, either using asynchronous programming or using threads.
> For example, each mapper could load batches of URLs from Hadoop into
> an internal work queue, and 100 threads per mapper could pull URLs off
> the work queue and push the HTTP responses onto another in-memory
> output queue. A separate thread could then steadily take items from
> the output queue and stream them back to Hadoop as key-value pairs.
>
> Hope that helps,
> Zak
>
>
> On Sun, Mar 7, 2010 at 7:54 AM, Phil McCarthy <philmccarthy@gmail.com>
> wrote:
> > Hi,
> >
> > I'm new to Hadoop, and I'm trying to figure out the best way to use it
> > to parallelize a large number of calls to a web API, and then process
> > and store the results.
> >
> > The calls will be regular HTTP requests, and the URLs follow a known
> > format, so can be generated easily. I'd like to understand how to
> > apply the MapReduce pattern to this task – should I have one mapper
> > generating URLs, and another making the HTTP calls and mapping request
> > URLs to their response documents, for example?
> >
> > Any links to sample code, examples etc. would be great.
> >
> > Cheers,
> > Phil
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message