hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From phil young <phil.wills.yo...@gmail.com>
Subject Fwd: Namenode corruption: need help quickly please
Date Tue, 26 Oct 2010 00:58:45 GMT
---------- Forwarded message ----------
From: phil young <phil.wills.young@gmail.com>
Date: Mon, Oct 25, 2010 at 8:30 PM
Subject: Re: Namenode corruption: need help quickly please
To: common-user@hadoop.apache.org


In the interests of helping others, here's some details on what happened to
us and how we recovered....


Incompatible Build Versions (between the
NameNode<https://twiki.tripadvisor.com/bin/edit/Development/NameNode?topicparent=Development.HadoopTroubleshooting;nowysiwyg=0>and
the
DataNodes<https://twiki.tripadvisor.com/bin/edit/Development/DataNodes?topicparent=Development.HadoopTroubleshooting;nowysiwyg=0>)


We and others have seen the following error. Apparently it occurs when
there's some change resulting in a difference in the "build" versions. This
is not DFS corruption but may apper to be so because the master and task
tracker processes start fine, but the
DataNodes<https://twiki.tripadvisor.com/bin/edit/Development/DataNodes?topicparent=Development.HadoopTroubleshooting;nowysiwyg=0>report
the following error:

*2010-10-25 18:35:38,470 FATAL
org.apache.hadoop.hdfs.server.datanode.DataNode: Incompatible build
versions: namenode BV = ; datanode BV =* xxxxx

This was caused by running "ant package" on the master.

To recover, we restored /hadoop on the master using the following steps:

   1. Stop the cluster (somewhat violently)
      1. Normal shut down
         1. stop-all.sh
       2. Find and kill lingering processes
         1. mon_jps #an alias in ~/.bash_profile that runs jps on all slaves

         2. kill -9 each running Java process
       3. Remove pid files
         1. ls -ltr /tmp/*pid
         2. rm -f /tmp/*pid #on each slave
       2. Restore /hadoop on the master from a slave
      1. cd /usr/local/hadoop
      2. mv hadoop-0.20.2 hadoop-0.20.2.MOVED
      3. restore hadoop-0.20.2 from a tarball generated on a slave
    3. Restore the original "conf" folder for the master (since it's not the
   same as the slaves)
      1. cd hadoop-0.20.2
      2. mv ./conf ./conf.MOVED
      3. cp -r ../../hadoop-0.20.2.MOVED/conf ./
    4. Start the cluster
      1. start-all.sh
      2. test_hadoop #an alias in ~/.bash_profile that runs a test
      map-reduce job







On Mon, Oct 25, 2010 at 8:00 PM, Brian Bockelman <bbockelm@cse.unl.edu>wrote:

>
> On Oct 25, 2010, at 6:35 PM, phil young wrote:
>
> > I had also assumed that some other jar or configuration file had been
> > changed, but reviewing the timestamps on the files did not reveal the
> > problem.
> > On the assumption that something did in fact change, that I was not
> seeing,
> > I renamed my $HADOOP_HOME directory and replaced it with one from a
> slave.
> > I then restored $HADOOP_HOME/conf from the original/renamed directory,
> and
> > voila - we're back in business.
> >
>
> Glad to hear this.
>
> > Brian, thanks very much for your help.
> > It took literally more time for me to write the original email (5
> minutes)
> > than to get a reply which indicated a way to solve the problem, and
> another
> > 5 minutes to solve it.
> > That says a lot about the user group. I don't think I would have reached
> a
> > human being in 5 minutes for the tech support for most products.
> > I'll make sure to monitor this list more closely so I can pay it forward
> ;)
> >
>
> No problem.  There are lots of good people on this list, and I certainly
> have done the "oh crap, I put my neck on the line for this new Hadoop thing
> and now its broke" email.
>
> Brian

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message