Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id EBB3517ED9 for ; Tue, 7 Apr 2015 15:20:12 +0000 (UTC) Received: (qmail 72215 invoked by uid 500); 7 Apr 2015 15:20:12 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 72176 invoked by uid 500); 7 Apr 2015 15:20:12 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 72164 invoked by uid 99); 7 Apr 2015 15:20:12 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 07 Apr 2015 15:20:12 +0000 Date: Tue, 7 Apr 2015 15:20:12 +0000 (UTC) From: "Ariel Weisberg (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CASSANDRA-9120) OutOfMemoryError when read auto-saved cache (probably broken) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CASSANDRA-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14483344#comment-14483344 ] Ariel Weisberg commented on CASSANDRA-9120: ------------------------------------------- If anyone who has some of these problem files could make available a copy that would be helpful. Or just the first few kilobytes. I would like to take a look at just what kind of problem value we are seeing. The odds of multiple people seeing corruption of the same length prefix bytes seems a little low. I feel like trying to figure out memory available and then decide how many keys to load is not a good heuristic to attempt. For older versions where we don't have a checksum I think the proposed patch makes sense. That is due to not having any better options when there is no checksum. Operators always have the option of removing the problem key caches so incentive to put together anything that might be fragile (like partially loading caches) isn't strong IMO. I do see an issue where an operator has changing memory layout (node size, heap size) or utilization and would want to be able to reduce the % of keys loaded. I am guessing the story for that now is that they delete the key cache? I don't see how # of keys saved matters. The # of keys saved is bounded by the # of keys loaded (or is it?) which means it fit into memory in the first place and should do it again on restart. Before we go ahead an implement something to work around this I would like to get a better sense of how we are arriving at this failure in the first place. > OutOfMemoryError when read auto-saved cache (probably broken) > ------------------------------------------------------------- > > Key: CASSANDRA-9120 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9120 > Project: Cassandra > Issue Type: Bug > Environment: Linux > Reporter: Vladimir > Assignee: Jeff Jirsa > Fix For: 3.0, 2.0.15, 2.1.5 > > > Found during tests on a 100 nodes cluster. After restart I found that one node constantly crashes with OutOfMemory Exception. I guess that auto-saved cache was corrupted and Cassandra can't recognize it. I see that similar issues was already fixed (when negative size of some structure was read). Does auto-saved cache have checksum? it'd help to reject corrupted cache at the very beginning. > As far as I can see current code still have that problem. Stack trace is: > {code} > INFO [main] 2015-03-28 01:04:13,503 AutoSavingCache.java (line 114) reading saved cache /storage/core/loginsight/cidata/cassandra/saved_caches/system-sstable_activity-KeyCache-b.db > ERROR [main] 2015-03-28 01:04:14,718 CassandraDaemon.java (line 513) Exception encountered during startup > java.lang.OutOfMemoryError: Java heap space > at java.util.ArrayList.(Unknown Source) > at org.apache.cassandra.db.RowIndexEntry$Serializer.deserialize(RowIndexEntry.java:120) > at org.apache.cassandra.service.CacheService$KeyCacheSerializer.deserialize(CacheService.java:365) > at org.apache.cassandra.cache.AutoSavingCache.loadSaved(AutoSavingCache.java:119) > at org.apache.cassandra.db.ColumnFamilyStore.(ColumnFamilyStore.java:262) > at org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:421) > at org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:392) > at org.apache.cassandra.db.Keyspace.initCf(Keyspace.java:315) > at org.apache.cassandra.db.Keyspace.(Keyspace.java:272) > at org.apache.cassandra.db.Keyspace.open(Keyspace.java:114) > at org.apache.cassandra.db.Keyspace.open(Keyspace.java:92) > at org.apache.cassandra.db.SystemKeyspace.checkHealth(SystemKeyspace.java:536) > at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:261) > at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:496) > at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:585) > {code} > I looked at source code of Cassandra and see: > http://grepcode.com/file/repo1.maven.org/maven2/org.apache.cassandra/cassandra-all/2.0.10/org/apache/cassandra/db/RowIndexEntry.java > 119 int entries = in.readInt(); > 120 List columnsIndex = new ArrayList(entries); > It seems that value entries is invalid (negative) and it tries too allocate an array with huge initial capacity and hits OOM. I have deleted saved_cache directory and was able to start node correctly. We should expect that it may happen in real world. Cassandra should be able to skip incorrect cached data and run. -- This message was sent by Atlassian JIRA (v6.3.4#6332)