hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Rawson <ryano...@gmail.com>
Subject Re: Region server going down
Date Sat, 17 Oct 2009 01:12:38 GMT
Hey,

Zookeeper is a pretty fundamental part of how we are making things
happen in hbase.  The problem is when you lose your session, this is
how we synchronize between the master and the regionserver.  At this
point neither side knows what the other knows, and the safest thing is
to abort the regionserver.  Without that, we can end up with multiple
region assignments which is pretty messy.

ZK is like DNS and the network, without it running, we are more or
less in trouble.  There is no effective difference between a crashed
machine and one that is having network problems, so they are treated
the same and recovery is the same.

Having said that, the session timeout is set in hbase, and i think
ships at 40 seconds or so.  So it should take more than a minor
problem or a few lost packets to induce a crash.  Now having said
that, if you are killing the entire ZK cluster and expecting HBase to
be ok, that is not really what will happen.  This is why ZK is run in
a 2N+1 scenario, so you can do rolling reboots, and survive N machine
loss.  But ZK is requires to be up 24/7, luckily it is fairly
reliable.

With hdfs 0.21, at least we'll be able to have effective hlog recovery.

Now, your specific problem looks like a common issue with the master
and regionservers being confused about what type of server they are
running. I don't personally run the indexed or transactional
extensions (they are not as inherently scalable), so maybe someone
else can chime in.

-ryan

On Fri, Oct 16, 2009 at 1:29 PM, Lucas Nazário dos Santos
<nazario.lucas@gmail.com> wrote:
> Hi,
>
> Today one regionserver crashed and I can't figure out why. Everything
> started with the message "server,60020,1255644477834 znode expired". I'm
> still running the cluster on little memory and swap is getting in my way
> from time to time (it's rare but I need to fix it). Can it be the cause of
> the error bellow? Do you think that five minutes is enough for the property
> zookeeper.session.timeout? Why the message "wrong key class:
> org.apache.hadoop.hbase.regionserver.HLogKey is not class"?
>
> My tests show that whenever zookeeper "shakes" the whole cluster goes down.
> Shouldn't HBase be more robust regarding Zookeeper? Something like a retry
> strategy...
>
> Lucas
>
>
>
> 2009-10-16 15:07:32,167 INFO org.apache.hadoop.hbase.master.ServerManager: 2
> region servers, 0 dead, average load 7.0
> 2009-10-16 15:07:32,537 INFO org.apache.hadoop.hbase.master.BaseScanner:
> RegionManager.rootScanner scanning meta region {server: 192.168.1.2:60020,
> regionname: -ROOT-,,0, startKey: <>}
> 2009-10-16 15:07:32,560 INFO org.apache.hadoop.hbase.master.BaseScanner:
> RegionManager.rootScanner scan of 1 row(s) of meta region {server:
> 192.168.1.2:60020, regionname: -ROOT-,,0, startKey: <>} complete
> 2009-10-16 15:07:32,654 INFO org.apache.hadoop.hbase.master.BaseScanner:
> RegionManager.metaScanner scanning meta region {server: 192.168.1.3:60020,
> regionname: .META.,,1, startKey: <>}
> 2009-10-16 15:07:32,804 INFO org.apache.hadoop.hbase.master.BaseScanner:
> RegionManager.metaScanner scan of 12 row(s) of meta region {server:
> 192.168.1.3:60020, regionname: .META.,,1, startKey: <>} complete
> 2009-10-16 15:07:32,804 INFO org.apache.hadoop.hbase.master.BaseScanner: All
> 1 .META. region(s) scanned
> 2009-10-16 15:08:09,551 INFO org.apache.hadoop.hbase.master.ServerManager:
> server,60020,1255644477834 znode expired
> 2009-10-16 15:08:09,605 INFO org.apache.hadoop.hbase.master.RegionManager:
> -ROOT- region unset (but not set to be reassigned)
> 2009-10-16 15:08:09,605 INFO
> org.apache.hadoop.hbase.master.RegionServerOperation: process shutdown of
> server server,60020,1255644477834: logSplit: false, rootRescanned: false,
> numberOfMetaRegions: 1, onlineMetaRegions.size(): 1
> 2009-10-16 15:08:09,623 INFO org.apache.hadoop.hbase.regionserver.HLog:
> Splitting 20 hlog(s) in
> hdfs://server2:9000/hbase/.logs/server,60020,1255644477834
> 2009-10-16 15:08:09,841 WARN org.apache.hadoop.hbase.regionserver.HLog:
> Exception processing
> hdfs://server2:9000/hbase/.logs/server,60020,1255644477834/hlog.dat.1255644478353
> -- continuing. Possible DATA LOSS!
> java.io.IOException: wrong key class:
> org.apache.hadoop.hbase.regionserver.HLogKey is not class
> org.apache.hadoop.hbase.regionserver.transactional.THLogKey
>        at
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1824)
>        at
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876)
>        at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:896)
>        at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:802)
>        at
> org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:274)
>        at
> org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:490)
>        at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:425)
> 2009-10-16 15:08:09,870 WARN org.apache.hadoop.hbase.regionserver.HLog:
> Exception processing
> hdfs://server2:9000/hbase/.logs/server,60020,1255644477834/hlog.dat.1255648058463
> -- continuing. Possible DATA LOSS!
> java.io.IOException: wrong key class:
> org.apache.hadoop.hbase.regionserver.HLogKey is not class
> org.apache.hadoop.hbase.regionserver.transactional.THLogKey
>        at
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1824)
>        at
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876)
>        at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:896)
>        at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:802)
>        at
> org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:274)
>        at
> org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:490)
>        at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:425)
> 2009-10-16 15:08:09,886 WARN org.apache.hadoop.hbase.regionserver.HLog:
> Exception processing hdfs://server2:9000/hbase/.logs/server,60020,12556
>
> // More wrong key class errors...
>
> 2009-10-16 15:08:10,203 INFO org.apache.hadoop.hbase.regionserver.HLog: hlog
> file splitting completed in 594 millis for
> hdfs://server2:9000/hbase/.logs/server,60020,1255644477834
> 2009-10-16 15:08:10,203 INFO
> org.apache.hadoop.hbase.master.RegionServerOperation: Log split complete,
> meta reassignment and scanning:
> 2009-10-16 15:08:10,203 INFO
> org.apache.hadoop.hbase.master.RegionServerOperation: ProcessServerShutdown
> reassigning ROOT region
> 2009-10-16 15:08:10,203 INFO org.apache.hadoop.hbase.master.RegionManager:
> -ROOT- region unset (but not set to be reassigned)
> 2009-10-16 15:08:10,203 INFO org.apache.hadoop.hbase.master.RegionManager:
> ROOT inserted into regionsInTransition
> 2009-10-16 15:08:32,167 INFO org.apache.hadoop.hbase.master.ServerManager: 1
> region servers, 1 dead, average load 6.0[server,60020,1255644477834]
>

Mime
View raw message