hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jagadeesh <jagadeesh...@gmail.com>
Subject RE: Urgent: Production Issues
Date Thu, 21 Dec 2006 07:56:42 GMT
Hi All,

Over the past day we have managed to migrate our clusters from 0.7.2 to
0.9.0. We had no hitches migrating our development cluster, and we just
finished up with upgrading our live cluster, through the process we picked
up a few tips for anybody else who is taking the same path:

* Always take a backup of your edits and fsimage files before migrating.

* Since  the live cluster was running for more than 3 months,  we  had some
problems with shutting it down  - something we didn't forsee when we tried
it on a much smaller development cluster (since it is restarted frequently)
. However I managed to do that by killing all the processes in the master
node as well as in the cluster nodes. There is no harm in doing that I
believe as the application interfacing with HDFS was shut down before
killing the processes.

* Please note that the Namenode was in Safe mode for a longer period that I
expected and I believe it was trying to re-index the files.

* With this new release there were some hiccups initially with respect to
the number of simultaneous connections allowed and we solved it by
introducing an object pool in our application.


Also, as a level of redundancy with the primary index, we are now mirroring
it using SuSE Linux clustering (it is a commercial product as part of SuSE
Linux Enterprise Server 10.2). This is the best way we have found to
introduce further redundancy with the index, previously we had tried to
solve this problem by synching and using heartbeat, and we also had our own
solution which would synch and attempt to detect a failure by making
frequent requests, but neither worked as well as the SuSE server.


We created virtual environments (User modes) in 3 servers running SuSE and
the masternode is running within that. So any change in those environments
will be propagated to other 2 servers in the SuSE cluster and in the event
of one server going down, other will take charge and the application can
still communicate using the same hostname / ip address, a very fast and
stable solution. Failure detection we developed ourselves so that the
cluster could respond faster, with custom requests and responses to make
sure our application is functioning (eg. instead of doing a very simple ping
test to check the state of a server in the cluster, we would do a number of
application-level API calls to make sure that the server is not only alive,
but that it responds with the expected result and there is full data


I would like to thank everybody who helped us out with the Hadoop aspects of
our storage cluster, if you are experiencing something similar to what we
are feel free to contact me.

Merry X'mas and Happy New Year!!!




-----Original Message-----
From: Jagadeesh [mailto:jagadeesh] 
Sent: Monday, December 18, 2006 11:30 AM
To: 'hadoop-user@lucene.apache.org'
Subject: Urgent: Production Issues

Hi All,

I am running Hadoop 0.7.2 in a production environment and it has stored
~170GB of data. Please read below the deployment architecture I am using.

I am using 4 nodes with 1.3TB storage each and the master node is not being
used for storage. So I have 5 servers in total out of which 4 servers are
running Hadoop nodes. This setup was working fine for the last 20-25 days
and there were no issues. As mentioned earlier, now the total storage has
gone upto ~170GB. 

Couple of days back, I noticed an error where Hadoop was not accepting new
files, I mean the upload always failed, but download was still working
great. I was getting the exception, writing <filename>.crc failed. When I
tried restarting the service, I was getting the message, jobtracker not
available and tasktracker not available. Then I had to kill all the
processes in the master node as well as in the client nodes to restart the

After that everything worked fine for a day more and now I keep on getting
the message 

failure closing block of file /user/root/.LICENSE.txt2233331.crc to node

Even if I restart the service, I get this message after 10 minutes.

I read in the mailing list that this issues is resolved in 0.9.0, but I am a
bit skeptical about moving to 0.9.0 as I don't know whether I will end up
loosing the files that are already stored. Kindly confirm this and I wil
move to 0.9.0 and also please tell me the steps or pre-cautions I should
take before moving to 0.9.0.

Thanks and Regards

View raw message