hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Purtell <apurt...@apache.org>
Subject Re: HBase Exceptions on version 0.20.1
Date Sat, 07 Nov 2009 19:05:44 GMT
You should consider provisioning more nodes to get beyond this ceiling you encountered. 

DFS write latency spikes from 3 seconds to 6 seconds, to 15! Flushing cannot happen fast enough
to avoid an OOME. Possibly there was even insufficient CPU to GC. The log entries you highlighted
indicate the load you are exerting on your current cluster needs to be spread out over more
resources than currently allocated.

This: 
> 2009-11-06 09:15:37,144 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 286007ms,
ten times longer than scheduled: 10000

indicates a thread that wanted to sleep for 10 seconds was starved for CPU for 286 seconds.
Obviously Zookeeper timeouts and resulting HBase process shutdowns, missed DFS heartbeats
possibly resulting in spurious declaration of dead datanodes, and other serious problems will
result from this. 

Did your systems start to swap? 

When region servers shut down, the master notices this and splits their HLogs into per region
reconstruction logs. These are the "oldlogfile.log" files. The master log will shed light
on why this particular reconstruction log was botched. Would have happened at the master.
The region server probably did do a clean shutdown. I suspect DFS was in extremis due to overloading
so the split failed. The checksum error indicates incomplete write at the OS level. Did a
datanode crash? 

HBASE-1956 is about making the DFS latency metric exportable via the
Hadoop metrics layer, perhaps via Ganglia. Write latency above 1 or 2
seconds is a warning. Anything above 5 seconds is an alarm.  It's a
good indication that an overloading condition is in progress. 

The Hadoop stack, being pre 1.0, has some rough edges. Response to overloading is one of them.
For one thing, HBase could be better about applying backpressure to writing clients when the
system is under stress. We will get there. HBASE-1956 is a start. 

    - Andy





________________________________
From: elsif <elsif.then@gmail.com>
To: hbase-user@hadoop.apache.org
Sent: Sun, November 8, 2009 1:12:12 AM
Subject: Re: HBase Exceptions on version 0.20.1

My most recent test made it past a couple days, and then one of the
region servers failed during a compaction:

2009-11-06 09:09:59,284 INFO
org.apache.hadoop.hbase.regionserver.HRegion: Start
ing compaction on region
fc_test,e9e8bab582b05d18d550d4112258b4c29707c33ebdd45447c20cce5c5a9e95a0,1257440539235
2009-11-06 09:09:59,294 WARN org.apache.hadoop.hbase.regionserver.HLog:
IPC Server handler 6 on 60020 took 4611ms appending an edit to hlog;
editcount=66
2009-11-06 09:10:14,298 WARN org.apache.hadoop.hbase.regionserver.HLog:
IPC Server handler 7 on 60020 took 3414ms appending an edit to hlog;
editcount=93
2009-11-06 09:10:17,729 WARN org.apache.hadoop.hbase.regionserver.HLog:
IPC Server handler 7 on 60020 took 3430ms appending an edit to hlog;
editcount=94
2009-11-06 09:10:17,734 WARN org.apache.hadoop.hbase.regionserver.HLog:
IPC Server handler 5 on 60020 took 6889ms appending an edit to hlog;
editcount=95
2009-11-06 09:10:17,739 WARN org.apache.hadoop.hbase.regionserver.HLog:
IPC Server handler 1 on 60020 took 6894ms appending an edit to hlog;
editcount=96
2009-11-06 09:10:17,743 WARN org.apache.hadoop.hbase.regionserver.HLog:
IPC Server handler 0 on 60020 took 6898ms appending an edit to hlog;
editcount=97
2009-11-06 09:10:17,745 WARN org.apache.hadoop.hbase.regionserver.HLog:
IPC Server handler 3 on 60020 took 6901ms appending an edit to hlog;
editcount=98
2009-11-06 09:10:17,752 WARN org.apache.hadoop.hbase.regionserver.HLog:
IPC Server handler 9 on 60020 took 6908ms appending an edit to hlog;
editcount=99
2009-11-06 09:10:20,047 WARN org.apache.hadoop.hbase.regionserver.HLog:
IPC Server handler 2 on 60020 took 9203ms appending an edit to hlog;
editcount=100
2009-11-06 09:10:20,074 INFO
org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Forced flushing of
fc_test,d96130bef65b02598c22140983e596f78cb3aac9249361054a3ac6ca287eb6ed,1257447874792
because global memstore limit of 2.3g exceeded; currently 2.3g and
flushing till 1.5g
2009-11-06 09:10:20,079 WARN org.apache.hadoop.hbase.regionserver.HLog:
IPC Server handler 8 on 60020 took 9235ms appending an edit to hlog;
editcount=107
2009-11-06 09:10:20,081 WARN org.apache.hadoop.hbase.regionserver.HLog:
IPC Server handler 6 on 60020 took 9238ms appending an edit to hlog;
editcount=108
.
.
.
2009-11-06 09:10:25,953 WARN org.apache.hadoop.hbase.regionserver.HLog:
IPC Server handler 4 on 60020 took 15105ms appending an edit to hlog;
editcount=114
2009-11-06 09:10:44,071 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
state: Disconnected, type: None, path: null
2009-11-06 09:15:37,144 WARN org.apache.hadoop.hbase.util.Sleeper: We
slept 286007ms, ten times longer than scheduled: 10000
2009-11-06 09:15:37,146 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
-1103392202834949346 lease expired
2009-11-06 09:15:37,146 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: OutOfMemoryError,
aborting.
java.lang.OutOfMemoryError: Java heap space
        at org.apache.hadoop.hbase.client.Put.readFields(Put.java:342)
        at
org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:421)
        at
org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:350)
        at
org.apache.hadoop.hbase.io.HbaseObjectWritable.readObject(HbaseObjectWritable.java:399)
        at
org.apache.hadoop.hbase.ipc.HBaseRPC$Invocation.readFields(HBaseRPC.java:174)
        at
org.apache.hadoop.hbase.ipc.HBaseServer$Connection.processData(HBaseServer.java:867)


Upon restarting the regionserver the following messages are written to
the log file over and over:

2009-11-07 00:01:00,299 WARN org.apache.hadoop.hbase.regionserver.Store:
Exception processing reconstruction log
hdfs://hfs-027010:8020/hbase/fc_test/338686730/oldlogfile.log opening
data                                                    
org.apache.hadoop.fs.ChecksumException: Checksum error:
/blk_7784569113613463369:of:/hbase/fc_test/338686730/oldlogfile.log at
2818048                                  at
org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)                   
                                                                  
at
org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)           
                                                                  
at
org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:189)          
at
org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)          
at
org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1147)                   
                                                                  
at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1696)          
                                                                 
at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1746)                
                                                                 
at java.io.DataInputStream.readFully(Unknown Source)                  
@                                                                      

My hbase settings are:

export HBASE_HEAPSIZE=6000
export HBASE_OPTS="$HBASE_OPTS -XX:NewSize=6m -XX:MaxNewSize=6m
-XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode"

I have 16G of memory on each node with 6G allocated to the region server
and 2G allocated to the datanode.

Is there any way to salvage the contents of the log file or is my only
option to move it out of the way and loose all the edits?

Would it be possible to have the regionserver do a clean shutdown before
it goes down the GC spiral?

stack wrote:
> Just to say I'm starting up test now....  will report back.  Thanks for your
> patience.
> St.Ack
>
>
>  


      
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message