hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erdem Agaoglu <erdem.agao...@gmail.com>
Subject Node failure causes weird META data?
Date Thu, 28 Oct 2010 07:58:48 GMT
Hi all,

We have a testing cluster of 6 nodes which we try to run an HBase/MapReduce
application on. In order to simulate a power failure we kill -9ed all things
hadoop related on one of the slave nodes (DataNode, RegionServer,
TaskTracker, ZK quorum peer and i think SecondaryNameNode was on this node
too). We were expecting a smooth transition on all services but were unable
to get on HBase end. While our regions seemed intact (not confirmed), we
lost table definitions that pointed some kind of META region fail. So our
application failed with several TableNotFoundExceptions. Simulation was
conducted with no-load and extremely small data (like 10 rows in 3 tables).

On our setup, HBase is 0.89.20100924, r1001068 while Hadoop
runs 0.20.3-append-r964955-1240, r960957. Most of the configuration
parameters are in default.

If we did something wrong up to this point, please ignore the rest of the
message as i'll try to explain what we did to reproduce it and might be
irrelevant.

Say the machines are named A, B, C, D, E, F; where A is master-like node,
others are slaves and power fail is on F. Since we have little data, we have
one ROOT and only one META region. I'll try to sum up the whole scenario.

A: NN, DN, JT, TT, HM, RS
B: DN, TT, RS, ZK
C: DN, TT, RS, ZK
D: DN, TT, RS, ZK
E: DN, TT, RS, ZK
F: SNN, DN, TT, RS, ZK

0. Initial state -> ROOT: F, META: A
1. Power fail on F -> ROOT: C, META: E -> lost tables, waited for about half
an hour to get nothing BTW
2. Put F back online -> No effect
3. Create a table 'testtable' to see if we lose it
4. Kill -9ed DataNode on F -> No effect -> Start it again
5. Kill -9ed RegionServer on F -> No effect -> Start it again
6. Kill -9ed RegionServer on E -> ROOT: C, META: A -> We lost 'testtable'
but get our tables from before the simulation. It seemed like because A had
META before the simulation, the table definitions were revived.
7. Restarted the whole cluster -> ROOT: A, META: F -> We lost 2 out of our
original 6 tables, 'testtable' revived. That small data seems corrupted too
as our Scans don't finish.
8. Run to mailing-list.

First of all thanks for reading up to this point. From what we are now, we
are not even sure if this is the expected behavior, like if ROOT or META
region dies we lose data and must do sth like hbck, or if we are missing a
configuration, or if this is a bug. No need to mention that we are
relatively new to HBase so the last possibility is that if we didn't
understand it at all.

Thanks in advance for any ideas.

-- 
erdem agaoglu

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message