hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Phil McCarthy <philmccar...@gmail.com>
Subject Re: Parallelizing HTTP calls with MapReduce
Date Sun, 07 Mar 2010 12:21:55 GMT
Thanks for the detailed answer, this will be useful stuff to know once
I'm optimizing/tuning.

I'm actually still at the stage of figuring out how to approach
applying the mapreduce pattern to the task, so I'll take your
suggestion of asking again on common-user.

Thanks!

On Sun, Mar 7, 2010 at 8:28 AM, Kay Kay <kaykay.unique@gmail.com> wrote:
> On 03/06/2010 09:29 AM, Phil McCarthy wrote:
>>
>> Hi,
>>
>> I'm new to Hadoop, and I'm trying to figure out the best way to use it
>> with EC2 to make large number of calls to a web API,
>
> Consider using a http client library / connection that is thread-safe
> potentially.
>>
>>  and then process
>> and store the results. I'm completely new to Hadoop, so I'm wondering
>> what's the best high-level approach, in terms of using MapReduce to
>> parallelize the process. The calls will be regular HTTP requests, and
>> the URLs follow a known format, so can be generated easily.
>>
>
> profile the mappers / reducers for memory usage ( primary) and observe the
> gc graph pattern for any crazy peaks/maximum-range of memory used and the
> cpu, after the same.
> While the programming language might be java, it might be best if you
> consider yourselves writing for a embedded environment and conserving bytes
> / new() / going slow on regex. etc.
> bandwidth of intermediate results , written to the context by the mappers
> (to hdfs, during the intermediate stage) and transferred to the reducers is
> a different thing altogether to be worth considered.
>
>> This seems like it'd be a pretty common type of task, so apologies if
>> I've missed something obvious in the docs etc.
>>
>
> Good luck ! As you might have figured out from the history - the list -
> common-user@hadoop.apache.org is more busier than this and irrespective of
> the name of the list being common, is still very relevant to hdfs /m-r
> questions.
>
>> Cheers,
>> Phil McCarthy
>>
>
>

Mime
View raw message