hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ruben Aguiar (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-13329) Memstore flush fails if data has always the same value, breaking the region
Date Wed, 25 Mar 2015 17:51:54 GMT

    [ https://issues.apache.org/jira/browse/HBASE-13329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14380350#comment-14380350
] 

Ruben Aguiar commented on HBASE-13329:
--------------------------------------

It seems this is not an error due to the data, but probably related to the metrics/tags in
openTSDB. After changing the script to run by sending random data between 1 and 10, and even
upping the region server itself (ran into some out of heap space errors) the error still persists.
Probably because I'm still writing to the same metric. Note however, that this time I've noticed
some further behaviours. First of all, on the process of flushing the region, it continued
to say running without ever timing out or finishing. During this behaviour, a file in /hbase/data/default/tsdb/317da7fabf9ea9b15de80377bb792cd8/.tmp
kept increasing in size until I killed the region server. On a stop-hbase command however,
the logs repeatedly showed this error:

2015-03-25 17:22:55,731 INFO  [regionserver//10.2.0.73:16020] regionserver.HRegionServer:
Received CLOSE for the region: 317da7fabf9ea9b
15de80377bb792cd8, which we are already trying to CLOSE, but not completed yet
2015-03-25 17:22:55,731 WARN  [regionserver//10.2.0.73:16020] regionserver.HRegionServer:
Failed to close tsdb,,1427300108453.317da7fabf
9ea9b15de80377bb792cd8. - ignoring and continuing
org.apache.hadoop.hbase.regionserver.RegionAlreadyInTransitionException: The region 317da7fabf9ea9b15de80377bb792cd8
was already closing
. New CLOSE request is ignored.
        at org.apache.hadoop.hbase.regionserver.HRegionServer.closeRegion(HRegionServer.java:2635)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.closeRegionIgnoreErrors(HRegionServer.java:2561)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.closeUserRegions(HRegionServer.java:2219)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:869)
        at java.lang.Thread.run(Thread.java:745)
 
I then issued a kill command on the region server and restarted the cluster. It behaved normally,
and even deleted the huge tmp file previously generated. Until after a while (2-3 minutes)
this error appeared on the region server logs:

2015-03-25 17:31:04,752 ERROR [regionserver//10.2.0.73:16020-shortCompactions-1427304663929]
regionserver.CompactSplitThread: Compaction failed Request = regionName=tsdb,,1427300108453.317da7fabf9ea9b15de80377bb792cd8.,
storeName=t, fileCount=6, fileSize=228.1 M (107.9 M, 11.9 M, 30.8 M, 25.0 M, 44.2 M, 8.5 M),
priority=4, time=9704521299920
java.lang.ArrayIndexOutOfBoundsException: -12190
        at org.apache.hadoop.hbase.CellComparator.getMinimumMidpointArray(CellComparator.java:478)
        at org.apache.hadoop.hbase.CellComparator.getMidpoint(CellComparator.java:448)
        at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.finishBlock(HFileWriterV2.java:165)
        at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.checkBlockBoundary(HFileWriterV2.java:146)
        at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:263)
        at org.apache.hadoop.hbase.io.hfile.HFileWriterV3.append(HFileWriterV3.java:87)
        at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:932)
        at org.apache.hadoop.hbase.regionserver.compactions.Compactor.performCompaction(Compactor.java:254)
        at org.apache.hadoop.hbase.regionserver.compactions.DefaultCompactor.compact(DefaultCompactor.java:102)
        at org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.compact(DefaultStoreEngine.java:110)
        at org.apache.hadoop.hbase.regionserver.HStore.compact(HStore.java:1167)
        at org.apache.hadoop.hbase.regionserver.HRegion.compact(HRegion.java:1610)
        at org.apache.hadoop.hbase.regionserver.CompactSplitThread$CompactionRunner.run(CompactSplitThread.java:511)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

This made the region server crash after a while (no writings were occuring at the time):

2015-03-25 17:35:42,379 WARN  [RS_CLOSE_REGION-10.2.0.73:16020-2] regionserver.DefaultMemStore:
Snapshot called again without clearing previous. Doing nothing. Another ongoing flush or did
we fail last attempt?
2015-03-25 17:35:42,379 WARN  [B.defaultRpcServer.handler=9,queue=0,port=16020] ipc.RpcServer:
(responseTooSlow): {"processingtimems":23059,"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)","client":"10.2.0.74:38861","starttimems":1427304919320,"queuetimems":0,"class":"HRegionServer","responsesize":95,"method":"Scan"}
2015-03-25 17:35:42,701 FATAL [RS_CLOSE_REGION-10.2.0.73:16020-2] regionserver.HRegionServer:
ABORTING region server 10.2.0.73,16020,1427304618843: Unrecoverable exception while closing
region tsdb,,1427300108453.317da7fabf9ea9b15de80377bb792cd8., still finishing close
org.apache.hadoop.hbase.DroppedSnapshotException: region: tsdb,,1427300108453.317da7fabf9ea9b15de80377bb792cd8.
        at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1999)
        at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1770)
        at org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1266)
        at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1198)
        at org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:138)
        at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ArrayIndexOutOfBoundsException

I'm still very new to the whole hbase environment but it seems to me the problem is the same
as I reported previously, but in this case happened on a Compaction. The first time it showed
unresponsive, the same probably happened while flushing, and it kept retrying (probably generating
and increasing that tmp). After the kill, the error happened on a compaction. If I can help
in any way to debug this problem please let me know. Any help is very appreciated.


> Memstore flush fails if data has always the same value, breaking the region
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-13329
>                 URL: https://issues.apache.org/jira/browse/HBASE-13329
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 1.0.1
>         Environment: linux-debian-jessie
> ec2 - t2.micro instances
>            Reporter: Ruben Aguiar
>            Priority: Critical
>             Fix For: 2.0.0, 1.1.0
>
>
> While trying to benchmark my opentsdb cluster, I've created a script that sends to hbase
always the same value (in this case 1). After a few minutes, the whole region server crashes
and the region itself becomes impossible to open again (cannot assign or unassign). After
some investigation, what I saw on the logs is that when a Memstore flush is called on a large
region (128mb) the process errors, killing the regionserver. On restart, replaying the edits
generates the same error, making the region unavailable. Tried to manually unassign, assign
or close_region. That didn't work because the code that reads/replays it crashes.
> From my investigation this seems to be an overflow issue. The logs show that the function
getMinimumMidpointArray tried to access index -32743 of an array, extremely close to the minimum
short value in Java. Upon investigation of the source code, it seems an index short is used,
being incremented as long as the two vectors are the same, probably making it overflow on
large vectors with equal data. Changing it to int should solve the problem.
> Here follows the hadoop logs of when the regionserver went down. Any help is appreciated.
Any other information you need please do tell me:
> 2015-03-24 18:00:56,187 INFO  [regionserver//10.2.0.73:16020.logRoller] wal.FSHLog: Rolled
WAL /hbase/WALs/10.2.0.73,16020,1427216382590/10.2.0.73%2C16020%2C1427216382590.default.1427220018516
with entries=143, filesize=134.70 MB; new WAL /hbase/WALs/10.2.0.73,16020,1427216382590/10.2.0.73%2C16020%2C1427216382590.default.1427220056140
> 2015-03-24 18:00:56,188 INFO  [regionserver//10.2.0.73:16020.logRoller] wal.FSHLog: Archiving
hdfs://10.2.0.74:8020/hbase/WALs/10.2.0.73,16020,1427216382590/10.2.0.73%2C16020%2C1427216382590.default.1427219987709
to hdfs://10.2.0.74:8020/hbase/oldWALs/10.2.0.73%2C16020%2C1427216382590.default.1427219987709
> 2015-03-24 18:04:35,722 INFO  [MemStoreFlusher.0] regionserver.HRegion: Started memstore
flush for tsdb,,1427133969325.52bc1994da0fea97563a4a656a58bec2., current region memstore size
128.04 MB
> 2015-03-24 18:04:36,154 FATAL [MemStoreFlusher.0] regionserver.HRegionServer: ABORTING
region server 10.2.0.73,16020,1427216382590: Replay of WAL required. Forcing server shutdown
> org.apache.hadoop.hbase.DroppedSnapshotException: region: tsdb,,1427133969325.52bc1994da0fea97563a4a656a58bec2.
> 	at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1999)
> 	at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1770)
> 	at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1702)
> 	at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:445)
> 	at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:407)
> 	at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$800(MemStoreFlusher.java:69)
> 	at org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:225)
> 	at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ArrayIndexOutOfBoundsException: -32743
> 	at org.apache.hadoop.hbase.CellComparator.getMinimumMidpointArray(CellComparator.java:478)
> 	at org.apache.hadoop.hbase.CellComparator.getMidpoint(CellComparator.java:448)
> 	at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.finishBlock(HFileWriterV2.java:165)
> 	at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.checkBlockBoundary(HFileWriterV2.java:146)
> 	at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:263)
> 	at org.apache.hadoop.hbase.io.hfile.HFileWriterV3.append(HFileWriterV3.java:87)
> 	at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:932)
> 	at org.apache.hadoop.hbase.regionserver.StoreFlusher.performFlush(StoreFlusher.java:121)
> 	at org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher.flushSnapshot(DefaultStoreFlusher.java:71)
> 	at org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:879)
> 	at org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2128)
> 	at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1953)
> 	... 7 more
> 2015-03-24 18:04:36,156 FATAL [MemStoreFlusher.0] regionserver.HRegionServer: RegionServer
abort: loaded coprocessors are: [org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message