Return-Path: X-Original-To: apmail-hbase-dev-archive@www.apache.org Delivered-To: apmail-hbase-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4D1FA1778E for ; Tue, 24 Mar 2015 19:15:54 +0000 (UTC) Received: (qmail 10833 invoked by uid 500); 24 Mar 2015 19:15:53 -0000 Delivered-To: apmail-hbase-dev-archive@hbase.apache.org Received: (qmail 10741 invoked by uid 500); 24 Mar 2015 19:15:53 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Received: (qmail 10729 invoked by uid 99); 24 Mar 2015 19:15:53 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 24 Mar 2015 19:15:53 +0000 Date: Tue, 24 Mar 2015 19:15:53 +0000 (UTC) From: "Ruben Aguiar (JIRA)" To: dev@hbase.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Created] (HBASE-13329) Memstore flush fails if data has always the same value, breaking the region MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 Ruben Aguiar created HBASE-13329: ------------------------------------ Summary: Memstore flush fails if data has always the same value, breaking the region Key: HBASE-13329 URL: https://issues.apache.org/jira/browse/HBASE-13329 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 1.0.1 Environment: linux-debian-jessie ec2 - t2.micro instances Reporter: Ruben Aguiar While trying to benchmark my opentsdb cluster, I've created a script that sends to hbase always the same value (in this case 1). After a few minutes, the whole region server crashes and the region itself becomes impossible to open again (cannot assign or unassign). After some investigation, what I saw on the logs is that when a Memstore flush is called on a large region (128mb) the process errors, killing the regionserver. On restart, replaying the edits generates the same error, making the region unavailable. Tried to manually unassign, assign or close_region. That didn't work because the code that reads/replays it crashes. >From my investigation this seems to be an overflow issue. The logs show that the function getMinimumMidpointArray tried to access index -32743 of an array, extremely close to the minimum short value in Java. Upon investigation of the source code, it seems an index short is used, being incremented as long as the two vectors are the same, probably making it overflow on large vectors with equal data. Changing it to int should solve the problem. Here follows the hadoop logs of when the regionserver went down. Any help is appreciated. Any other information you need please do tell me: 2015-03-24 18:00:56,187 INFO [regionserver//10.2.0.73:16020.logRoller] wal.FSHLog: Rolled WAL /hbase/WALs/10.2.0.73,16020,1427216382590/10.2.0.73%2C16020%2C1427216382590.default.1427220018516 with entries=143, filesize=134.70 MB; new WAL /hbase/WALs/10.2.0.73,16020,1427216382590/10.2.0.73%2C16020%2C1427216382590.default.1427220056140 2015-03-24 18:00:56,188 INFO [regionserver//10.2.0.73:16020.logRoller] wal.FSHLog: Archiving hdfs://10.2.0.74:8020/hbase/WALs/10.2.0.73,16020,1427216382590/10.2.0.73%2C16020%2C1427216382590.default.1427219987709 to hdfs://10.2.0.74:8020/hbase/oldWALs/10.2.0.73%2C16020%2C1427216382590.default.1427219987709 2015-03-24 18:04:35,722 INFO [MemStoreFlusher.0] regionserver.HRegion: Started memstore flush for tsdb,,1427133969325.52bc1994da0fea97563a4a656a58bec2., current region memstore size 128.04 MB 2015-03-24 18:04:36,154 FATAL [MemStoreFlusher.0] regionserver.HRegionServer: ABORTING region server 10.2.0.73,16020,1427216382590: Replay of WAL required. Forcing server shutdown org.apache.hadoop.hbase.DroppedSnapshotException: region: tsdb,,1427133969325.52bc1994da0fea97563a4a656a58bec2. at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1999) at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1770) at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:1702) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:445) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:407) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$800(MemStoreFlusher.java:69) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:225) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ArrayIndexOutOfBoundsException: -32743 at org.apache.hadoop.hbase.CellComparator.getMinimumMidpointArray(CellComparator.java:478) at org.apache.hadoop.hbase.CellComparator.getMidpoint(CellComparator.java:448) at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.finishBlock(HFileWriterV2.java:165) at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.checkBlockBoundary(HFileWriterV2.java:146) at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:263) at org.apache.hadoop.hbase.io.hfile.HFileWriterV3.append(HFileWriterV3.java:87) at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:932) at org.apache.hadoop.hbase.regionserver.StoreFlusher.performFlush(StoreFlusher.java:121) at org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher.flushSnapshot(DefaultStoreFlusher.java:71) at org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:879) at org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2128) at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:1953) ... 7 more 2015-03-24 18:04:36,156 FATAL [MemStoreFlusher.0] regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: [org.apache.hadoop.hbase.coprocessor.MultiRowMutationEndpoint] -- This message was sent by Atlassian JIRA (v6.3.4#6332)