hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "manish.hadoop.work" <manish.hadoop.w...@gmail.com>
Subject Re: Namenode / Cluster scaling issues in AWS environment
Date Sun, 03 Nov 2013 16:12:23 GMT
Hey Chris,

Thanks for reply, if you are talking about Hadoop 2.x hdfs federation then its difficult for
me to migrate.

Regards,
Manish


Sent from my T-Mobile 4G LTE Device

-------- Original message --------
From: Chris Mawata <chris.mawata@gmail.com> 
Date: 11/03/2013  6:14 AM  (GMT-08:00) 
To: user@hadoop.apache.org 
Subject: Re: Namenode / Cluster scaling issues in AWS environment 
 
You might also consider  federation.
Chris


On 11/3/2013 3:21 AM, Manish Malhotra wrote:
> Hi All,
>
> I'm facing issues in scaling a Hadoop cluster, I have following 
> cluster config.
>
>
> 1. AWS Infrastructure.
> 2. 400 DN
> 3. NN :
>             120 gb memory, 10gb network,32 cores
>             dfs.namenode.handler.count = 128
>              ipc queue size = 128 ( default)
> 4. DN: 15.5 gb memory. 1 gb network, 8cores
> 5. Hadoop version: 1.0.2
>
>
> Problem: Sometime NN becomes unstable, and started showing DN's as down.
> But actually DNs are running.
> I have seen "Socket timeout exception" from DN and also " xrecievers 
> Exception".
> Looks like the NN is busy for that time, and suddenly it start loosing 
> the hearbeat of DNs.
> Once it sees DNs are down, it start replicating blocks to other nodes, 
> but then again more nodes become unavailable and again it tries to 
> replicate those blocks.
> This is like a cycle where NN trapped, and not able to come out.
> NN looks good from Memory and CPU usage point of view.
> Maximum it uses 150% CPU, I believe 1.0.2 version is not using multi 
> cores, and uses single core only
>
> Potential Reasons:
>
> 1. Small files, we have lots and lots of small files, we are working 
> on it.
> 2. AWS Infra is not reliable, so should increase the 
> "datanode.recheck.interval" property to give more time before 
> declaring DN as dead.
> 3. Lots of connections to NN from clients and MR jobs.
> 4. DNs have issues in terms of Memory / Threads, so that its actually 
> not even connecting to the NN.
> But have not seen the OOM issue, yet.
>
> 5. NN threaddump at the time of issue, showing all the Handler threads 
> are in waiting for lock state.
>
> If anybody has similar experience with Hadoop on AWS or any infra and 
> can give some input that will be great.
>
> Regards,
> Manish
>

Mime
View raw message