hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: Hang when add/remove a datanode into/from a 2 datanode cluster
Date Thu, 01 Aug 2013 03:11:41 GMT
As I said before, it is a per-file property and the config can be
bypassed by clients that do not read the configs, place a manual API
override, etc..

If you want to really define a hard maximum and catch such clients,
try setting dfs.replication.max to 2 at your NameNode.

On Thu, Aug 1, 2013 at 8:07 AM, sam liu <samliuhadoop@gmail.com> wrote:
> But, please mention that the value of 'dfs.replication' of the cluster is
> always 2, even when the datanode number is 3. And I am pretty sure I did not
> manually create any files with rep=3. So, why were some files of hdfs
> created with repl=3, but not repl=2?
>
>
> 2013/8/1 Harsh J <harsh@cloudera.com>
>>
>> The step (a) points to your problem and solution both. You have files
>> being created with repl=3 on a 2 DN cluster which will prevent
>> decommission. This is not a bug.
>>
>> On Wed, Jul 31, 2013 at 12:09 PM, sam liu <samliuhadoop@gmail.com> wrote:
>> > I opened a jira for tracking this issue:
>> > https://issues.apache.org/jira/browse/HDFS-5046
>> >
>> >
>> > 2013/7/2 sam liu <samliuhadoop@gmail.com>
>> >>
>> >> Yes, the default replication factor is 3. However, in my case, it's
>> >> strange: during decommission hangs, I found some block's expected
>> >> replicas
>> >> is 3, but the 'dfs.replication' value in hdfs-site.xml of every cluster
>> >> node
>> >> is always 2 from the beginning of cluster setup. Below is my steps:
>> >>
>> >> 1. Install a Hadoop 1.1.1 cluster, with 2 datanodes: dn1 and dn2. And,
>> >> in
>> >> hdfs-site.xml, set the 'dfs.replication' to 2
>> >> 2. Add node dn3 into the cluster as a new datanode, and did not change
>> >> the
>> >> 'dfs.replication' value in hdfs-site.xml and keep it as 2
>> >> note: step 2 passed
>> >> 3. Decommission dn3 from the cluster
>> >> Expected result: dn3 could be decommissioned successfully
>> >> Actual result:
>> >> a). decommission progress hangs and the status always be 'Waiting
>> >> DataNode
>> >> status: Decommissioned'. But, if I execute 'hadoop dfs -setrep -R 2 /',
>> >> the
>> >> decommission continues and will be completed finally.
>> >> b). However, if the initial cluster includes >= 3 datanodes, this issue
>> >> won't be encountered when add/remove another datanode. For example, if
>> >> I
>> >> setup a cluster with 3 datanodes, and then I can successfully add the
>> >> 4th
>> >> datanode into it, and then also can successfully remove the 4th
>> >> datanode
>> >> from the cluster.
>> >>
>> >> I doubt it's a bug and plan to open a jira to Hadoop HDFS for this. Any
>> >> comments?
>> >>
>> >> Thanks!
>> >>
>> >>
>> >> 2013/6/21 Harsh J <harsh@cloudera.com>
>> >>>
>> >>> The dfs.replication is a per-file parameter. If you have a client that
>> >>> does not use the supplied configs, then its default replication is 3
>> >>> and all files it will create (as part of the app or via a job config)
>> >>> will be with replication factor 3.
>> >>>
>> >>> You can do an -lsr to find all files and filter which ones have been
>> >>> created with a factor of 3 (versus expected config of 2).
>> >>>
>> >>> On Fri, Jun 21, 2013 at 3:13 PM, sam liu <samliuhadoop@gmail.com>
>> >>> wrote:
>> >>> > Hi George,
>> >>> >
>> >>> > Actually, in my hdfs-site.xml, I always set 'dfs.replication'to
2.
>> >>> > But
>> >>> > still
>> >>> > encounter this issue.
>> >>> >
>> >>> > Thanks!
>> >>> >
>> >>> >
>> >>> > 2013/6/21 George Kousiouris <gkousiou@mail.ntua.gr>
>> >>> >>
>> >>> >>
>> >>> >> Hi,
>> >>> >>
>> >>> >> I think i have faced this before, the problem is that you have
the
>> >>> >> rep
>> >>> >> factor=3 so it seems to hang because it needs 3 nodes to achieve
>> >>> >> the
>> >>> >> factor
>> >>> >> (replicas are not created on the same node). If you set the
>> >>> >> replication
>> >>> >> factor=2 i think you will not have this issue. So in general
you
>> >>> >> must
>> >>> >> make
>> >>> >> sure that the rep factor is <= to the available datanodes.
>> >>> >>
>> >>> >> BR,
>> >>> >> George
>> >>> >>
>> >>> >>
>> >>> >> On 6/21/2013 12:29 PM, sam liu wrote:
>> >>> >>
>> >>> >> Hi,
>> >>> >>
>> >>> >> I encountered an issue which hangs the decommission operatoin.
Its
>> >>> >> steps:
>> >>> >> 1. Install a Hadoop 1.1.1 cluster, with 2 datanodes: dn1 and
dn2.
>> >>> >> And,
>> >>> >> in
>> >>> >> hdfs-site.xml, set the 'dfs.replication' to 2
>> >>> >> 2. Add node dn3 into the cluster as a new datanode, and did
not
>> >>> >> change
>> >>> >> the
>> >>> >> 'dfs.replication' value in hdfs-site.xml and keep it as 2
>> >>> >> note: step 2 passed
>> >>> >> 3. Decommission dn3 from the cluster
>> >>> >>
>> >>> >> Expected result: dn3 could be decommissioned successfully
>> >>> >>
>> >>> >> Actual result: decommission progress hangs and the status always
be
>> >>> >> 'Waiting DataNode status: Decommissioned'
>> >>> >>
>> >>> >> However, if the initial cluster includes >= 3 datanodes,
this issue
>> >>> >> won't
>> >>> >> be encountered when add/remove another datanode.
>> >>> >>
>> >>> >> Also, after step 2, I noticed that some block's expected replicas
>> >>> >> is
>> >>> >> 3,
>> >>> >> but the 'dfs.replication' value in hdfs-site.xml is always
2!
>> >>> >>
>> >>> >> Could anyone pls help provide some triages?
>> >>> >>
>> >>> >> Thanks in advance!
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> --
>> >>> >> ---------------------------
>> >>> >>
>> >>> >> George Kousiouris, PhD
>> >>> >> Electrical and Computer Engineer
>> >>> >> Division of Communications,
>> >>> >> Electronics and Information Engineering
>> >>> >> School of Electrical and Computer Engineering
>> >>> >> Tel: +30 210 772 2546
>> >>> >> Mobile: +30 6939354121
>> >>> >> Fax: +30 210 772 2569
>> >>> >> Email: gkousiou@mail.ntua.gr
>> >>> >> Site: http://users.ntua.gr/gkousiou/
>> >>> >>
>> >>> >> National Technical University of Athens
>> >>> >> 9 Heroon Polytechniou str., 157 73 Zografou, Athens, Greece
>> >>> >
>> >>> >
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Harsh J
>> >>
>> >>
>> >
>>
>>
>>
>> --
>> Harsh J
>
>



-- 
Harsh J

Mime
View raw message