hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sam liu <samliuhad...@gmail.com>
Subject Re: Hang when add/remove a datanode into/from a 2 datanode cluster
Date Thu, 01 Aug 2013 02:37:02 GMT
But, please mention that the value of 'dfs.replication' of the cluster is
always 2, even when the datanode number is 3. And I am pretty sure I did
not manually create any files with rep=3. So, why were some files of hdfs
created with repl=3, but not repl=2?


2013/8/1 Harsh J <harsh@cloudera.com>

> The step (a) points to your problem and solution both. You have files
> being created with repl=3 on a 2 DN cluster which will prevent
> decommission. This is not a bug.
>
> On Wed, Jul 31, 2013 at 12:09 PM, sam liu <samliuhadoop@gmail.com> wrote:
> > I opened a jira for tracking this issue:
> > https://issues.apache.org/jira/browse/HDFS-5046
> >
> >
> > 2013/7/2 sam liu <samliuhadoop@gmail.com>
> >>
> >> Yes, the default replication factor is 3. However, in my case, it's
> >> strange: during decommission hangs, I found some block's expected
> replicas
> >> is 3, but the 'dfs.replication' value in hdfs-site.xml of every cluster
> node
> >> is always 2 from the beginning of cluster setup. Below is my steps:
> >>
> >> 1. Install a Hadoop 1.1.1 cluster, with 2 datanodes: dn1 and dn2. And,
> in
> >> hdfs-site.xml, set the 'dfs.replication' to 2
> >> 2. Add node dn3 into the cluster as a new datanode, and did not change
> the
> >> 'dfs.replication' value in hdfs-site.xml and keep it as 2
> >> note: step 2 passed
> >> 3. Decommission dn3 from the cluster
> >> Expected result: dn3 could be decommissioned successfully
> >> Actual result:
> >> a). decommission progress hangs and the status always be 'Waiting
> DataNode
> >> status: Decommissioned'. But, if I execute 'hadoop dfs -setrep -R 2 /',
> the
> >> decommission continues and will be completed finally.
> >> b). However, if the initial cluster includes >= 3 datanodes, this issue
> >> won't be encountered when add/remove another datanode. For example, if I
> >> setup a cluster with 3 datanodes, and then I can successfully add the
> 4th
> >> datanode into it, and then also can successfully remove the 4th datanode
> >> from the cluster.
> >>
> >> I doubt it's a bug and plan to open a jira to Hadoop HDFS for this. Any
> >> comments?
> >>
> >> Thanks!
> >>
> >>
> >> 2013/6/21 Harsh J <harsh@cloudera.com>
> >>>
> >>> The dfs.replication is a per-file parameter. If you have a client that
> >>> does not use the supplied configs, then its default replication is 3
> >>> and all files it will create (as part of the app or via a job config)
> >>> will be with replication factor 3.
> >>>
> >>> You can do an -lsr to find all files and filter which ones have been
> >>> created with a factor of 3 (versus expected config of 2).
> >>>
> >>> On Fri, Jun 21, 2013 at 3:13 PM, sam liu <samliuhadoop@gmail.com>
> wrote:
> >>> > Hi George,
> >>> >
> >>> > Actually, in my hdfs-site.xml, I always set 'dfs.replication'to 2.
> But
> >>> > still
> >>> > encounter this issue.
> >>> >
> >>> > Thanks!
> >>> >
> >>> >
> >>> > 2013/6/21 George Kousiouris <gkousiou@mail.ntua.gr>
> >>> >>
> >>> >>
> >>> >> Hi,
> >>> >>
> >>> >> I think i have faced this before, the problem is that you have
the
> rep
> >>> >> factor=3 so it seems to hang because it needs 3 nodes to achieve
the
> >>> >> factor
> >>> >> (replicas are not created on the same node). If you set the
> >>> >> replication
> >>> >> factor=2 i think you will not have this issue. So in general you
> must
> >>> >> make
> >>> >> sure that the rep factor is <= to the available datanodes.
> >>> >>
> >>> >> BR,
> >>> >> George
> >>> >>
> >>> >>
> >>> >> On 6/21/2013 12:29 PM, sam liu wrote:
> >>> >>
> >>> >> Hi,
> >>> >>
> >>> >> I encountered an issue which hangs the decommission operatoin.
Its
> >>> >> steps:
> >>> >> 1. Install a Hadoop 1.1.1 cluster, with 2 datanodes: dn1 and dn2.
> And,
> >>> >> in
> >>> >> hdfs-site.xml, set the 'dfs.replication' to 2
> >>> >> 2. Add node dn3 into the cluster as a new datanode, and did not
> change
> >>> >> the
> >>> >> 'dfs.replication' value in hdfs-site.xml and keep it as 2
> >>> >> note: step 2 passed
> >>> >> 3. Decommission dn3 from the cluster
> >>> >>
> >>> >> Expected result: dn3 could be decommissioned successfully
> >>> >>
> >>> >> Actual result: decommission progress hangs and the status always
be
> >>> >> 'Waiting DataNode status: Decommissioned'
> >>> >>
> >>> >> However, if the initial cluster includes >= 3 datanodes, this
issue
> >>> >> won't
> >>> >> be encountered when add/remove another datanode.
> >>> >>
> >>> >> Also, after step 2, I noticed that some block's expected replicas
is
> >>> >> 3,
> >>> >> but the 'dfs.replication' value in hdfs-site.xml is always 2!
> >>> >>
> >>> >> Could anyone pls help provide some triages?
> >>> >>
> >>> >> Thanks in advance!
> >>> >>
> >>> >>
> >>> >>
> >>> >> --
> >>> >> ---------------------------
> >>> >>
> >>> >> George Kousiouris, PhD
> >>> >> Electrical and Computer Engineer
> >>> >> Division of Communications,
> >>> >> Electronics and Information Engineering
> >>> >> School of Electrical and Computer Engineering
> >>> >> Tel: +30 210 772 2546
> >>> >> Mobile: +30 6939354121
> >>> >> Fax: +30 210 772 2569
> >>> >> Email: gkousiou@mail.ntua.gr
> >>> >> Site: http://users.ntua.gr/gkousiou/
> >>> >>
> >>> >> National Technical University of Athens
> >>> >> 9 Heroon Polytechniou str., 157 73 Zografou, Athens, Greece
> >>> >
> >>> >
> >>>
> >>>
> >>>
> >>> --
> >>> Harsh J
> >>
> >>
> >
>
>
>
> --
> Harsh J
>

Mime
View raw message