hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From philmccar...@gmail.com
Subject Re: Parallelizing HTTP calls with MapReduce
Date Tue, 09 Mar 2010 22:44:38 GMT
Thanks for the replies, it sounds like there are a couple of different  
approaches for me to investigate. All of these requests will actually  
be to the same service, which should reduce DNS overhead, but rate  
limiting is going to be an issue — their developer guidelines are  
pretty clear on that.

I'm looking at an order of around 10 million calls, although there's  
no real performance requirement — it's a one-off job, but I'd prefer  
it not to take days.

Thanks again for the suggestions,

Phil

On 8 Mar 2010, at 18:46, Aaron Kimball <aaron@cloudera.com> wrote:

> I think you should actually use the Java-based MapReduce here.
>
> As has been noted, these will be network-bound calls. And if you're  
> trying to make a lot of them, my experience is that individual calls  
> are slow. 10,000 GET requests could each take a second or two,  
> especially if they involve DNS lookups. But they can be overlapped.
>
> If you're using the old API, consider using the Multithreaded  
> maprunner for this (I think that's  
> org.apache.hadoop.mapred.lib.MultithreadedMapRunner):
>
> JobConf job = new JobConf();
> job.setMapRunnerClass(MultithreadedMapRunner.class);
>
>  If you're using the new API, there's an analagous  
> o.a.h.mapreduce.lib.mapper.MultithreadedMapper that you should extend.
>
> This will allow you to pipeline all those requests and get much  
> faster throughput. (Each map task starts a thread pool of a few  
> threads, which will be given individual map inputs in an overlapped  
> fashion. The same instance of your Mapper class will be used across  
> all threads, so make sure to protect any instance variables.)
>
> For maximum efficiency, sort all your different URLs by hostname  
> first, so that each split of the input contains all the requests to  
> the same server -- this will allow your DNS caching to be much more  
> efficient (rather than have all your mappers try to DNS lookup the  
> same set of hosts).
>
> Of course, you want to be careful with something like this. A big  
> Hadoop cluster can easily bring a web server to its knees if you're  
> using too many map tasks in parallel on the same target :) You may  
> want to actually do some rate-limiting of requests to the same  
> node... but how to do that easily is a separate discussion.
>
> - Aaron
>
>
> On Sun, Mar 7, 2010 at 9:46 AM, Erez Katz <erez_katz@yahoo.com> wrote:
> It should be very easy. If you just have say a list of URLS as  
> input...
> It is not even map-reduce task... just map task (with no reduce, i  
> don't see where you do a reduce on a key in this scenario).
> Look for map only tasks in the streaming documentation.
>
> Just pick your favorite scripting language that keeps reading urls  
> form the standard input stream line by line and outputs the result  
> to the standard output.
>
> ala python:
>
> import urllib,sys
>
> for line in sys.stdin:
>  url = line.strip()
>  x = urllib.urlopen(url)
>  print x.read()
>  u.close()
>
>
> That's all folks.
>
>
> No real reason to use Java/C++ here, most of the time will be spend  
> over network IO.
>
>
> Cheers,
>
>  Erez Katz
>
>
> --- On Sat, 3/6/10, Phil McCarthy <philmccarthy@gmail.com> wrote:
>
> > From: Phil McCarthy <philmccarthy@gmail.com>
> > Subject: Parallelizing HTTP calls with MapReduce
> > To: mapreduce-user@hadoop.apache.org
> > Date: Saturday, March 6, 2010, 9:29 AM
> > Hi,
> >
> > I'm new to Hadoop, and I'm trying to figure out the best
> > way to use it
> > with EC2 to make large number of calls to a web API, and
> > then process
> > and store the results. I'm completely new to Hadoop, so I'm
> > wondering
> > what's the best high-level approach, in terms of using
> > MapReduce to
> > parallelize the process. The calls will be regular HTTP
> > requests, and
> > the URLs follow a known format, so can be generated
> > easily.
> >
> > This seems like it'd be a pretty common type of task, so
> > apologies if
> > I've missed something obvious in the docs etc.
> >
> > Cheers,
> > Phil McCarthy
> >
>
>
>
>

Mime
View raw message