hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From yz5od2 <woods5242-outdo...@yahoo.com>
Subject Re: architecture help
Date Mon, 16 Nov 2009 15:33:14 GMT
Thanks all for the replies, that makes sense. I think I am allocating  
connection resources per-mapper, instead of per-task.

How do I programatically allocate a "pool" or shared resource for a  
task, that all Mapper instances can have access to?

1) I have 4 nodes, each node has a map capacity of 2 for a total of 8  
tasks running simultaneously. The job I am running is queuing up ~950  
tasks that need to be done.

2) the mysql server I am connecting to is configured to permit 300  

2) When a Mapper instance starts, right now each mapper instance is  
handling the connections, obviously this is my problem as each task  
must be spinning up dozens/hundreds of mapper instances to process the  
task (is that right? or does one mapper instance process an entire  
split?). I need to move this to the "task", but this is where I need  
some pointers on where to look.

When I submit my job is there some way to say:




On Nov 15, 2009, at 7:57 PM, Jeff Zhang wrote:

> Each map task will run in an separate JVM. So you should create  
> connection
> pool for each task, And all the mapper instances in one task share  
> the same
> connection pool.
> Another suggestion is that you can use JNDI to manger the  
> connection . It
> can be shared by all the map tasks in your cluster.
> Jeff Zhang
> On Mon, Nov 16, 2009 at 8:52 AM, yz5od2 <woods5242- 
> outdoors@yahoo.com>wrote:
>> Hi,
>> a) I have a Mapper ONLY job, the job reads in records, then parses  
>> them
>> apart.  No reduce phase
>> b) I would like this mapper job to save the record into a shared  
>> mysql
>> database on the network.
>> c) I am running a 4 node cluster, and obviously running out of  
>> connections
>> very quickly, that is something I can work on the db server side.
>> What I am trying to understand, is that for each mapper task  
>> instance that
>> is processing an input split... does that run in its own  
>> classloader? I
>> guess I am trying to figure out how to manage a connection pool on  
>> each
>> processing node, so that all mapper instances would use that to get  
>> access
>> to the database. Right now it appears that each node is creating  
>> thousands
>> of mapper instance each with their own connection management, hence  
>> this is
>> blowing up quite quickly. I would like the connection management to  
>> live
>> separately from the mapper instances per node.
>> I hope I am explaining what I want to do ok, please let me know if  
>> anyone
>> has any thoughts, tips, best practices, features I should look at  
>> etc.
>> thanks

View raw message