hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From modemide <modem...@gmail.com>
Subject Re: lost data with 1 failed datanode and replication factor 3 in 6 node cluster
Date Fri, 21 Oct 2011 12:04:37 GMT
Hi Ossi,

I'm not sure about how experienced you are with hadoop.  I'm still
learning myself.  But here's my guess as to what happened.  I
apologize in advance if this is below your current knowledge of
Hadoop.

There are a couple of pieces which I know of that determine file
replication in your situation.  One is your manually setting
replication factor, the other is the config on the client from which
you uploaded the data from.

Assuming your namenode didn't complain about missing blocks on its web
control panel, you may have had your client set to a replication
factor of 1.  If this was the case, the file that was uploaded (from
the client) to HDFS will have a replication factor of one.

A couple of ways to confirm/disprove this theory:
1) Go to the name node control panel (http://<NAMENODE>:50070 by default).
    Browse the file system
    Navigate to a file that was created after setting the replication
factor on the cluster
    The 4th column from the left is a field called replication.  That
should tell you what the replication factor is for any particular
file.

2) On the client that you use to upload files to HDFS, check your hdfs-site.xml
    Should be located in $HADOOP_HOME/conf/hdfs-site.xml


Hope that helps!


Tim






On 10/21/11, Ossi <lossil@gmail.com> wrote:
> hi,
>
> We managed to lost data when 1 datanode broke down in a cluster of 6
> datanodes with
> replication factor 3.
>
> As far as I know, that shouldn't happen, since each blocks should have 1
> copy in
> 3 different hosts. So, loosing even 2 nodes should be fine.
>
> Earlier we did some tests with replication factor 2, but reverted from that:
>    88  2011-10-12 06:46:49 hadoop dfs -setrep -w 2 -R /
>   148  2011-10-12 10:22:09 hadoop dfs -setrep -w 3 -R /
>
> The lost data was generated after replication factor was set back to 3.
> And even if replication factor would have been 2, data shouldn't have been
> lost, right?
>
> We wonder how that is possible and in what situations that could happen?
>
>
> br, Ossi
>

Mime
View raw message