hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erez Katz <erez_k...@yahoo.com>
Subject Re: Parallelizing HTTP calls with MapReduce
Date Sun, 07 Mar 2010 17:46:13 GMT
It should be very easy. If you just have say a list of URLS as input...
It is not even map-reduce task... just map task (with no reduce, i don't see where you do
a reduce on a key in this scenario).
Look for map only tasks in the streaming documentation.

Just pick your favorite scripting language that keeps reading urls form the standard input
stream line by line and outputs the result to the standard output. 

ala python:

import urllib,sys

for line in sys.stdin:
  url = line.strip()
  x = urllib.urlopen(url)
  print x.read()

That's all folks.

No real reason to use Java/C++ here, most of the time will be spend over network IO.

  Erez Katz

--- On Sat, 3/6/10, Phil McCarthy <philmccarthy@gmail.com> wrote:

> From: Phil McCarthy <philmccarthy@gmail.com>
> Subject: Parallelizing HTTP calls with MapReduce
> To: mapreduce-user@hadoop.apache.org
> Date: Saturday, March 6, 2010, 9:29 AM
> Hi,
> I'm new to Hadoop, and I'm trying to figure out the best
> way to use it
> with EC2 to make large number of calls to a web API, and
> then process
> and store the results. I'm completely new to Hadoop, so I'm
> wondering
> what's the best high-level approach, in terms of using
> MapReduce to
> parallelize the process. The calls will be regular HTTP
> requests, and
> the URLs follow a known format, so can be generated
> easily.
> This seems like it'd be a pretty common type of task, so
> apologies if
> I've missed something obvious in the docs etc.
> Cheers,
> Phil McCarthy


View raw message