hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: Hang when add/remove a datanode into/from a 2 datanode cluster
Date Wed, 31 Jul 2013 16:56:57 GMT
The step (a) points to your problem and solution both. You have files
being created with repl=3 on a 2 DN cluster which will prevent
decommission. This is not a bug.

On Wed, Jul 31, 2013 at 12:09 PM, sam liu <samliuhadoop@gmail.com> wrote:
> I opened a jira for tracking this issue:
> https://issues.apache.org/jira/browse/HDFS-5046
>
>
> 2013/7/2 sam liu <samliuhadoop@gmail.com>
>>
>> Yes, the default replication factor is 3. However, in my case, it's
>> strange: during decommission hangs, I found some block's expected replicas
>> is 3, but the 'dfs.replication' value in hdfs-site.xml of every cluster node
>> is always 2 from the beginning of cluster setup. Below is my steps:
>>
>> 1. Install a Hadoop 1.1.1 cluster, with 2 datanodes: dn1 and dn2. And, in
>> hdfs-site.xml, set the 'dfs.replication' to 2
>> 2. Add node dn3 into the cluster as a new datanode, and did not change the
>> 'dfs.replication' value in hdfs-site.xml and keep it as 2
>> note: step 2 passed
>> 3. Decommission dn3 from the cluster
>> Expected result: dn3 could be decommissioned successfully
>> Actual result:
>> a). decommission progress hangs and the status always be 'Waiting DataNode
>> status: Decommissioned'. But, if I execute 'hadoop dfs -setrep -R 2 /', the
>> decommission continues and will be completed finally.
>> b). However, if the initial cluster includes >= 3 datanodes, this issue
>> won't be encountered when add/remove another datanode. For example, if I
>> setup a cluster with 3 datanodes, and then I can successfully add the 4th
>> datanode into it, and then also can successfully remove the 4th datanode
>> from the cluster.
>>
>> I doubt it's a bug and plan to open a jira to Hadoop HDFS for this. Any
>> comments?
>>
>> Thanks!
>>
>>
>> 2013/6/21 Harsh J <harsh@cloudera.com>
>>>
>>> The dfs.replication is a per-file parameter. If you have a client that
>>> does not use the supplied configs, then its default replication is 3
>>> and all files it will create (as part of the app or via a job config)
>>> will be with replication factor 3.
>>>
>>> You can do an -lsr to find all files and filter which ones have been
>>> created with a factor of 3 (versus expected config of 2).
>>>
>>> On Fri, Jun 21, 2013 at 3:13 PM, sam liu <samliuhadoop@gmail.com> wrote:
>>> > Hi George,
>>> >
>>> > Actually, in my hdfs-site.xml, I always set 'dfs.replication'to 2. But
>>> > still
>>> > encounter this issue.
>>> >
>>> > Thanks!
>>> >
>>> >
>>> > 2013/6/21 George Kousiouris <gkousiou@mail.ntua.gr>
>>> >>
>>> >>
>>> >> Hi,
>>> >>
>>> >> I think i have faced this before, the problem is that you have the rep
>>> >> factor=3 so it seems to hang because it needs 3 nodes to achieve the
>>> >> factor
>>> >> (replicas are not created on the same node). If you set the
>>> >> replication
>>> >> factor=2 i think you will not have this issue. So in general you must
>>> >> make
>>> >> sure that the rep factor is <= to the available datanodes.
>>> >>
>>> >> BR,
>>> >> George
>>> >>
>>> >>
>>> >> On 6/21/2013 12:29 PM, sam liu wrote:
>>> >>
>>> >> Hi,
>>> >>
>>> >> I encountered an issue which hangs the decommission operatoin. Its
>>> >> steps:
>>> >> 1. Install a Hadoop 1.1.1 cluster, with 2 datanodes: dn1 and dn2. And,
>>> >> in
>>> >> hdfs-site.xml, set the 'dfs.replication' to 2
>>> >> 2. Add node dn3 into the cluster as a new datanode, and did not change
>>> >> the
>>> >> 'dfs.replication' value in hdfs-site.xml and keep it as 2
>>> >> note: step 2 passed
>>> >> 3. Decommission dn3 from the cluster
>>> >>
>>> >> Expected result: dn3 could be decommissioned successfully
>>> >>
>>> >> Actual result: decommission progress hangs and the status always be
>>> >> 'Waiting DataNode status: Decommissioned'
>>> >>
>>> >> However, if the initial cluster includes >= 3 datanodes, this issue
>>> >> won't
>>> >> be encountered when add/remove another datanode.
>>> >>
>>> >> Also, after step 2, I noticed that some block's expected replicas is
>>> >> 3,
>>> >> but the 'dfs.replication' value in hdfs-site.xml is always 2!
>>> >>
>>> >> Could anyone pls help provide some triages?
>>> >>
>>> >> Thanks in advance!
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> ---------------------------
>>> >>
>>> >> George Kousiouris, PhD
>>> >> Electrical and Computer Engineer
>>> >> Division of Communications,
>>> >> Electronics and Information Engineering
>>> >> School of Electrical and Computer Engineering
>>> >> Tel: +30 210 772 2546
>>> >> Mobile: +30 6939354121
>>> >> Fax: +30 210 772 2569
>>> >> Email: gkousiou@mail.ntua.gr
>>> >> Site: http://users.ntua.gr/gkousiou/
>>> >>
>>> >> National Technical University of Athens
>>> >> 9 Heroon Polytechniou str., 157 73 Zografou, Athens, Greece
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> Harsh J
>>
>>
>



-- 
Harsh J

Mime
View raw message