Mailing-List: contact common-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: local policy)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws;
  s=s1024; d=yahoo.com;
  h=Received:X-Yahoo-SMTP:X-YMail-OSG:X-Yahoo-Newman-Property:Message-Id:From:To:In-Reply-To:Content-Type:Content-Transfer-Encoding:Mime-Version:Subject:Date:References:X-Mailer;
  b=cDEnNp4PYTtHpH8R2fDq3/b6hvKGCXi6bRV8ePXb5sMfWzRrbr343BEl8icjP2ZfC6xBOesInyyyXmRtgY4Dr7OZ6JR0bcxWF1+jw1OYrIc2EPaPknHum2ks2WkKWXhqPCOGM5rvXaf+OXQBxDMUtk3VuQYaLA+PSxhbBCanHAg=
  ;
Message-Id: <5657441A-65A5-47A3-B982-4EF89BBB692F@yahoo.com>
From: yz5od2 <woods5242-outdoors@yahoo.com>
To: common-user@hadoop.apache.org
In-Reply-To: <8211a1320911151857q647dc64cn6e1120bc64993841@mail.gmail.com>
Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
Content-Transfer-Encoding: 7bit
Mime-Version: 1.0 (Apple Message framework v936)
Subject: Re: architecture help
Date: Mon, 16 Nov 2009 08:33:14 -0700
References: <E8A92A35-364B-4DC4-A232-E011EE20F22B@yahoo.com>
 <8211a1320911151857q647dc64cn6e1120bc64993841@mail.gmail.com>

Thanks all for the replies, that makes sense. I think I am allocating  
connection resources per-mapper, instead of per-task.

How do I programatically allocate a "pool" or shared resource for a  
task, that all Mapper instances can have access to?

1) I have 4 nodes, each node has a map capacity of 2 for a total of 8  
tasks running simultaneously. The job I am running is queuing up ~950  
tasks that need to be done.

2) the mysql server I am connecting to is configured to permit 300  
connections.

2) When a Mapper instance starts, right now each mapper instance is  
handling the connections, obviously this is my problem as each task  
must be spinning up dozens/hundreds of mapper instances to process the  
task (is that right? or does one mapper instance process an entire  
split?). I need to move this to the "task", but this is where I need  
some pointers on where to look.

When I submit my job is there some way to say:

jobConf 
.setTaskHandlingClass 
(SomeClassThatCreatesThePoolThatTaskMapperInstancesAccess.class)

??

	-

On Nov 15, 2009, at 7:57 PM, Jeff Zhang wrote:

> Each map task will run in an separate JVM. So you should create  
> connection
> pool for each task, And all the mapper instances in one task share  
> the same
> connection pool.
>
> Another suggestion is that you can use JNDI to manger the  
> connection . It
> can be shared by all the map tasks in your cluster.
>
>
> Jeff Zhang
>
>
>
>
> On Mon, Nov 16, 2009 at 8:52 AM, yz5od2 <woods5242- 
> outdoors@yahoo.com>wrote:
>
>> Hi,
>>
>> a) I have a Mapper ONLY job, the job reads in records, then parses  
>> them
>> apart.  No reduce phase
>>
>> b) I would like this mapper job to save the record into a shared  
>> mysql
>> database on the network.
>>
>> c) I am running a 4 node cluster, and obviously running out of  
>> connections
>> very quickly, that is something I can work on the db server side.
>>
>> What I am trying to understand, is that for each mapper task  
>> instance that
>> is processing an input split... does that run in its own  
>> classloader? I
>> guess I am trying to figure out how to manage a connection pool on  
>> each
>> processing node, so that all mapper instances would use that to get  
>> access
>> to the database. Right now it appears that each node is creating  
>> thousands
>> of mapper instance each with their own connection management, hence  
>> this is
>> blowing up quite quickly. I would like the connection management to  
>> live
>> separately from the mapper instances per node.
>>
>> I hope I am explaining what I want to do ok, please let me know if  
>> anyone
>> has any thoughts, tips, best practices, features I should look at  
>> etc.
>>
>> thanks
>>
>>
>>
>>
>>