hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From phil young <phil.wills.yo...@gmail.com>
Subject Fwd: Namenode corruption: need help quickly please
Date Tue, 26 Oct 2010 00:58:45 GMT
---------- Forwarded message ----------
From: phil young <phil.wills.young@gmail.com>
Date: Mon, Oct 25, 2010 at 8:30 PM
Subject: Re: Namenode corruption: need help quickly please
To: common-user@hadoop.apache.org

In the interests of helping others, here's some details on what happened to
us and how we recovered....

Incompatible Build Versions (between the

We and others have seen the following error. Apparently it occurs when
there's some change resulting in a difference in the "build" versions. This
is not DFS corruption but may apper to be so because the master and task
tracker processes start fine, but the
the following error:

*2010-10-25 18:35:38,470 FATAL
org.apache.hadoop.hdfs.server.datanode.DataNode: Incompatible build
versions: namenode BV = ; datanode BV =* xxxxx

This was caused by running "ant package" on the master.

To recover, we restored /hadoop on the master using the following steps:

   1. Stop the cluster (somewhat violently)
      1. Normal shut down
         1. stop-all.sh
       2. Find and kill lingering processes
         1. mon_jps #an alias in ~/.bash_profile that runs jps on all slaves

         2. kill -9 each running Java process
       3. Remove pid files
         1. ls -ltr /tmp/*pid
         2. rm -f /tmp/*pid #on each slave
       2. Restore /hadoop on the master from a slave
      1. cd /usr/local/hadoop
      2. mv hadoop-0.20.2 hadoop-0.20.2.MOVED
      3. restore hadoop-0.20.2 from a tarball generated on a slave
    3. Restore the original "conf" folder for the master (since it's not the
   same as the slaves)
      1. cd hadoop-0.20.2
      2. mv ./conf ./conf.MOVED
      3. cp -r ../../hadoop-0.20.2.MOVED/conf ./
    4. Start the cluster
      1. start-all.sh
      2. test_hadoop #an alias in ~/.bash_profile that runs a test
      map-reduce job

On Mon, Oct 25, 2010 at 8:00 PM, Brian Bockelman <bbockelm@cse.unl.edu>wrote:

> On Oct 25, 2010, at 6:35 PM, phil young wrote:
> > I had also assumed that some other jar or configuration file had been
> > changed, but reviewing the timestamps on the files did not reveal the
> > problem.
> > On the assumption that something did in fact change, that I was not
> seeing,
> > I renamed my $HADOOP_HOME directory and replaced it with one from a
> slave.
> > I then restored $HADOOP_HOME/conf from the original/renamed directory,
> and
> > voila - we're back in business.
> >
> Glad to hear this.
> > Brian, thanks very much for your help.
> > It took literally more time for me to write the original email (5
> minutes)
> > than to get a reply which indicated a way to solve the problem, and
> another
> > 5 minutes to solve it.
> > That says a lot about the user group. I don't think I would have reached
> a
> > human being in 5 minutes for the tech support for most products.
> > I'll make sure to monitor this list more closely so I can pay it forward
> ;)
> >
> No problem.  There are lots of good people on this list, and I certainly
> have done the "oh crap, I put my neck on the line for this new Hadoop thing
> and now its broke" email.
> Brian

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message