hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From prasenjit mukherjee <prasen....@gmail.com>
Subject Re: Parallelizing HTTP calls with Hadoop
Date Sun, 07 Mar 2010 13:16:26 GMT
Thanks to Mridul, here is an approach suggested by him based on pig,
which works fine for me :

input_lines = load 'my_s3_list_file' as (location_line:chararray);
grp_op = GROUP input_lines BY location_line PARALLEL $NUM_MAPPERS_REQUIRED;
actual_result = FOREACH grp_op GENERATE MY_S3_UDF(group);

I had the same problem to solve ( parallelizing s3_fetch ).
MY_S3_UDF() actually does the http_fetching.


On Sun, Mar 7, 2010 at 6:24 PM, Phil McCarthy <philmccarthy@gmail.com> wrote:
> Hi,
> I'm new to Hadoop, and I'm trying to figure out the best way to use it
> to parallelize a large number of calls to a web API, and then process
> and store the results.
> The calls will be regular HTTP requests, and the URLs follow a known
> format, so can be generated easily. I'd like to understand how to
> apply the MapReduce pattern to this task – should I have one mapper
> generating URLs, and another making the HTTP calls and mapping request
> URLs to their response documents, for example?
> Any links to sample code, examples etc. would be great.
> Cheers,
> Phil

View raw message