hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tapas Sarangi <tapas.sara...@gmail.com>
Subject Re: Why do some blocks refuse to replicate...?
Date Thu, 28 Mar 2013 22:07:09 GMT
Did you check if you have any disk that is "read-only" for the nodes that has the missing blocks
? If you know which are the blocks, you can manually copy the blocks and the corresponding
'.meta' file to another node. Hadoop will re-read those blocks and replicate them.


On Mar 28, 2013, at 4:23 PM, Felix GV <felix@mate1inc.com> wrote:

> Yes, I didn't specify how I was testing my changes, but basically, here's what I did:
> My hdfs-site.xml file was modified to include a reference the a file containing a list
of all datanodes (via dfs.hosts) and a reference to a file containing decommissioned nodes
(via dfs.hosts.exclude). After that, I just changed these files, not hdfs-site.xml.
> I first added all my old nodes in the dfs.hosts.exclude file, did hdfs dfsadmin -refreshNodes,
and most of the data replicated correctly.
> I then tried removing all old nodes from the dfs.hosts file, did hdfs dfsadmin -refreshNodes,
and I saw that I now had a coupe of corrupt and missing blocks (60 of them).
> I re-added all the old nodes in the dfs.hosts file, and removed them gradually, each
time doing the refreshNodes or restarting the NN, and I narrowed it down to three datanodes
in particular, which seem to be the three nodes where all of those 60 blocks are located.
> Is it possible, perhaps, that these three nodes are completely incapable of replicating
what they have (because they're corrupt or something), and so every block was replicated from
other nodes, but the blocks that happened to be located on these three nodes are... doomed?
I can see the data in those blocks in the NN hdfs browser, so I guess it's not corrupted...
I also tried pinging the new nodes from those old ones and it works too, so I guess there
is no network partition...
> I'm in the process of increasing replication factor above 3, but I don't know if that's
gonna do anything...
> --
> Felix
> On Thu, Mar 28, 2013 at 4:45 PM, MARCOS MEDRADO RUBINELLI <marcosm@buscapecompany.com>
> Felix,
> After changing hdfs-site.xml, did you run "hadoop dfsadmin -refreshNodes"? That should
have been enough, but you can try increasing the replication factor of these files, wait for
them to be replicated to the new nodes, then setting it back to its original value.
> Cheers,
> Marcos
> In 28-03-2013 17:00, Felix GV wrote:
>> Hello,
>> I've been running a virtualized CDH 4.2 cluster. I now want to migrate all my data
to another (this time physical) set of slaves and then stop using the virtualized slaves.
>> I added the new physical slaves in the cluster, and marked all the old virtualized
slaves as decommissioned using the dfs.hosts.exclude setting in hdfs-site.xml.
>> Almost all of the data replicated successfully to the new slaves, but when I bring
down the old slaves, some blocks start showing up as missing or corrupt (according to the
NN UI as well as fsck*). If I restart the old slaves, then there are no missing blocks reported
by fsck.
>> I've tried shutting down the old slaves two by two, and for some of them I saw no
problem, but then at some point I found two slaves which, when shut down, resulted in a couple
of blocks being under-replicated (1 out of 3 replicas found). For example, fsck would report
stuff like this:
>> /user/hive/warehouse/ads_destinations_hosts/part-m-00012:  Under replicated BP-1207449144-
Target Replicas is 3 but found 1 replica(s).
>> The system then stayed in that state apparently forever. It never actually fixed
the fact some blocks were under-replicated. Does that mean there's something wrong with some
of the old datanodes...? Why do they keep block for themselves (even thought they're decommissioned)
instead of replicating those blocks to the new (non-decommissioned) datanodes?
>> How do I force replication of under-replicated blocks?
>> *Actually, the NN UI and fsck report slightly different things. The NN UI always
seems to report 60 under-replicated blocks, whereas fsck only reports those 60 under-replicated
blocks when I shut down some of the old datanodes... When the old nodes are up, fsck reports
0 under-replicated blocks... This is very confusing!
>> Any help would be appreciated! Please don't hesitate to ask if I should provide some
of my logs, settings, or the output of some commands...!
>> Thanks :) !
>> --
>> Felix

View raw message