Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 68DB217809 for ; Tue, 14 Apr 2015 20:51:59 +0000 (UTC) Received: (qmail 26253 invoked by uid 500); 14 Apr 2015 20:51:59 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 26211 invoked by uid 500); 14 Apr 2015 20:51:59 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 26198 invoked by uid 99); 14 Apr 2015 20:51:59 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 14 Apr 2015 20:51:59 +0000 Date: Tue, 14 Apr 2015 20:51:59 +0000 (UTC) From: "Ariel Weisberg (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CASSANDRA-9120) OutOfMemoryError when read auto-saved cache (probably broken) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CASSANDRA-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14494852#comment-14494852 ] Ariel Weisberg commented on CASSANDRA-9120: ------------------------------------------- Maybe we should improve the error message on OOM? If there is an OOM loading the cache log a message that guides the operator towards what to do? bq. In this case it could be good to add some information about startup options (same file or separate). What do you mean by add information? I am just looking for something that doesn't have a potential downside. Creating that file is a one way ticket to skipping the caches no matter how many restarts there are to get the system up. If there is some other bug or configuration issue preventing the database from coming up it could be an issue. > OutOfMemoryError when read auto-saved cache (probably broken) > ------------------------------------------------------------- > > Key: CASSANDRA-9120 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9120 > Project: Cassandra > Issue Type: Bug > Environment: Linux > Reporter: Vladimir > Fix For: 3.0, 2.0.15, 2.1.5 > > > Found during tests on a 100 nodes cluster. After restart I found that one node constantly crashes with OutOfMemory Exception. I guess that auto-saved cache was corrupted and Cassandra can't recognize it. I see that similar issues was already fixed (when negative size of some structure was read). Does auto-saved cache have checksum? it'd help to reject corrupted cache at the very beginning. > As far as I can see current code still have that problem. Stack trace is: > {code} > INFO [main] 2015-03-28 01:04:13,503 AutoSavingCache.java (line 114) reading saved cache /storage/core/loginsight/cidata/cassandra/saved_caches/system-sstable_activity-KeyCache-b.db > ERROR [main] 2015-03-28 01:04:14,718 CassandraDaemon.java (line 513) Exception encountered during startup > java.lang.OutOfMemoryError: Java heap space > at java.util.ArrayList.(Unknown Source) > at org.apache.cassandra.db.RowIndexEntry$Serializer.deserialize(RowIndexEntry.java:120) > at org.apache.cassandra.service.CacheService$KeyCacheSerializer.deserialize(CacheService.java:365) > at org.apache.cassandra.cache.AutoSavingCache.loadSaved(AutoSavingCache.java:119) > at org.apache.cassandra.db.ColumnFamilyStore.(ColumnFamilyStore.java:262) > at org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:421) > at org.apache.cassandra.db.ColumnFamilyStore.createColumnFamilyStore(ColumnFamilyStore.java:392) > at org.apache.cassandra.db.Keyspace.initCf(Keyspace.java:315) > at org.apache.cassandra.db.Keyspace.(Keyspace.java:272) > at org.apache.cassandra.db.Keyspace.open(Keyspace.java:114) > at org.apache.cassandra.db.Keyspace.open(Keyspace.java:92) > at org.apache.cassandra.db.SystemKeyspace.checkHealth(SystemKeyspace.java:536) > at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:261) > at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:496) > at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:585) > {code} > I looked at source code of Cassandra and see: > http://grepcode.com/file/repo1.maven.org/maven2/org.apache.cassandra/cassandra-all/2.0.10/org/apache/cassandra/db/RowIndexEntry.java > 119 int entries = in.readInt(); > 120 List columnsIndex = new ArrayList(entries); > It seems that value entries is invalid (negative) and it tries too allocate an array with huge initial capacity and hits OOM. I have deleted saved_cache directory and was able to start node correctly. We should expect that it may happen in real world. Cassandra should be able to skip incorrect cached data and run. -- This message was sent by Atlassian JIRA (v6.3.4#6332)