Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A29E09590 for ; Thu, 14 Jun 2012 09:59:52 +0000 (UTC) Received: (qmail 11219 invoked by uid 500); 14 Jun 2012 09:59:50 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 10990 invoked by uid 500); 14 Jun 2012 09:59:47 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 10941 invoked by uid 99); 14 Jun 2012 09:59:46 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Jun 2012 09:59:46 +0000 X-ASF-Spam-Status: No, hits=1.0 required=5.0 tests=FSL_RCVD_USER,SPF_NEUTRAL,TO_NO_BRKTS_PCNT X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: 132.72.42.23 is neither permitted nor denied by domain of lolitushka@gmail.com) Received: from [132.72.42.23] (HELO indigo.cs.bgu.ac.il) (132.72.42.23) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Jun 2012 09:59:37 +0000 Received: from [132.72.41.93] (lesbinux [132.72.41.93]) by indigo.cs.bgu.ac.il (8.13.6/8.13.6) with ESMTP id q5EADjNs012737 for ; Thu, 14 Jun 2012 13:13:46 +0300 (IDT) Message-ID: <4FD9B635.5020401@gmail.com> Date: Thu, 14 Jun 2012 13:00:21 +0300 From: Piavlo User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:7.0.1) Gecko/20110929 Thunderbird/7.0.1 MIME-Version: 1.0 To: user@cassandra.apache.org Subject: Re: Urgent - IllegalArgumentException during compaction and memtable flush References: <4FD983FB.9040901@gmail.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Checker-Version: SpamAssassin 3.2.3 (2007-08-08) on gandalf X-Old-Spam-Status: No, score=-1.8 required=6.5 tests=ALL_TRUSTED autolearn=no version=3.2.3 Hi Sylvain, Yes this UserCompletions CF uses composite comparator and I do use sstable compression. What's the procedure to check if the compressed sstable is corrupted or not? If it's corrupted what can I do to fix the issue with minimal cluster load impact? Is there way to delete all UserCompletions sstables on the problematic node and then run repair on this CF only? Like disable thrift, drain memtables so it does not read commit log on startup and then delete the sstables and start the node again will it work? BUT since I saw this error in 3 nodes (and RF=3 too) in ValidationExecutor at almost the same time (at 3 different times - Probably due to 3 attempts of reruning "repair -pr UserCompletions dsc2b.internal" which never returned from the blocked nodeool command - an each time repair finished the new sstables trigger compations on all involed nodes) can it mean that sstable is not corrupted but just some BAD column name was inserted OK but can not be read later read by ValidationExecutor in any of the replica nodes? Check the relevant cassandra logs below dsc2b.internal/10.234.71.33 ----------------------- INFO [AntiEntropySessions:66] 2012-06-13 18:49:24,464 AntiEntropyService.java (line 658) [repair #7ec142c0-b588-11e1-0000-f423231d3fff] new session: will sync dsc2b.internal/10.234.71.33, /10.49.127.4, /10.58.249.118 on range (85070591730234615865843651857942052864,113427455640312821154458202477256070485] for PRODUCTION.[UserCompletions] INFO [AntiEntropySessions:66] 2012-06-13 18:49:24,465 AntiEntropyService.java (line 837) [repair #7ec142c0-b588-11e1-0000-f423231d3fff] requests for merkle tree sent for UserCompletions (to [/10.49.127.4, /10.58.249.118, dsc2b.internal/10.234.71.33]) INFO [ValidationExecutor:129] 2012-06-13 18:49:24,466 ColumnFamilyStore.java (line 705) Enqueuing flush of Memtable-UserCompletions@843906517(9952311/21343163 serialized/live bytes, 41801 ops) INFO [FlushWriter:2563] 2012-06-13 18:49:24,467 Memtable.java (line 246) Writing Memtable-UserCompletions@843906517(9952311/21343163 serialized/live bytes, 41801 ops) INFO [FlushWriter:2563] 2012-06-13 18:49:24,828 Memtable.java (line 283) Completed flushing /var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-515-Data.db (1671566 bytes) ERROR [ValidationExecutor:129] 2012-06-13 18:55:32,236 AbstractCassandraDaemon.java (line 139) Fatal exception in thread Thread[ValidationExecutor:129,1,main] java.lang.IllegalArgumentException at java.nio.Buffer.limit(Buffer.java:249) .... ----------------------- dsc1a.internal/10.49.127.4 ----------------------- INFO [ValidationExecutor:125] 2012-06-13 18:49:24,457 ColumnFamilyStore.java (line 705) Enqueuing flush of Memtable-UserCompletions@266077104(9047552/76151840 serialized/live bytes, 38000 ops) INFO [FlushWriter:2670] 2012-06-13 18:49:24,466 Memtable.java (line 246) Writing Memtable-UserCompletions@266077104(9047552/76151840 serialized/live bytes, 38000 ops) INFO [FlushWriter:2670] 2012-06-13 18:49:24,969 Memtable.java (line 283) Completed flushing /var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-1030-Data.db (1508368 bytes) INFO [CompactionExecutor:3299] 2012-06-13 18:49:24,971 CompactionTask.java (line 115) Compacting [SSTableReader(path='/var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-1027-Data.db'), SSTableReader(path='/var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-1030-Data.db'), SSTableReader(path='/var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-1028-Data.db'), SSTableReader(path='/var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-1029-Data.db')] INFO [CompactionExecutor:3299] 2012-06-13 18:50:03,554 CompactionTask.java (line 223) Compacted to [/var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-1031-Data.db,]. 23,417,251 to 23,832,802 (~101% of original) bytes for 116,956 keys at 0.589102MB/s. Time: 38,582ms. ERROR [ValidationExecutor:125] 2012-06-13 18:56:58,961 AbstractCassandraDaemon.java (line 139) Fatal exception in thread Thread[ValidationExecutor:125,1,main] java.lang.IllegalArgumentException at java.nio.Buffer.limit(Buffer.java:249) ... ------------------------- dsc2c.internal/10.58.249.118 ------------------------- INFO [ValidationExecutor:119] 2012-06-13 18:49:24,305 ColumnFamilyStore.java (line 705) Enqueuing flush of Memtable-UserCompletions@1279460811(19014066/66201229 serialized/live bytes, 79838 ops) INFO [FlushWriter:2001] 2012-06-13 18:49:24,326 Memtable.java (line 246) Writing Memtable-UserCompletions@1279460811(19014066/66201229 serialized/live bytes, 79838 ops) INFO [FlushWriter:2001] 2012-06-13 18:49:24,848 Memtable.java (line 283) Completed flushing /var/lib/cassandra/data/PRODUCTION/UserCompletions-hc-548-Data.db (3177074 bytes) ERROR [ValidationExecutor:119] 2012-06-13 18:55:50,387 AbstractCassandraDaemon.java (line 139) Fatal exception in thread Thread[ValidationExecutor:119,1,main] java.lang.IllegalArgumentException at java.nio.Buffer.limit(Buffer.java:249) ... ------------------------- Thanks for your help. On 06/14/2012 11:09 AM, Sylvain Lebresne wrote: > On Thu, Jun 14, 2012 at 8:26 AM, Piavlo wrote: >> I started looking for similar messages on other nodes saw a SINGLE IllegalArgumentException on >> ValidationExecutor on the same node and 2 other nodes (this is a 6 node cluster) which happened >> at almost the same time , in all nodes while flushing same UserCompletions CF memtable. This >> happened about 12hours before the IllegalArgumentException in CompactionExecutor. > This actually does not happen during a flush but during a validation > compaction, which happens during a repair. > The exception is basically saying there is invalid composite column > name (you do use a composite comparator right?). > I guess that could result from some on-disk corruption. Are you using > sstable compression on UserCompletions? (I am asking because > compressed sstables have checksums) > >> And even bigger problem now is that running repairs on other CFs against >> different nodes does not have any effect, for example running >> /usr/bin/nodetool -h dsc2b.internal -pr repair PRODUCTION UserDirectVendors >> does not trigger any repair activity and nothing in the logs to indicate a >> start of repair. And I have ~24hours left to repair some CFs before the gc >> period ends :( > Does that happen on every node? > What can happen is that some failed repair may block other from > starting. One thing you can try is to run the method called > forceTerminateAllRepairessions in JMX under > org.apache.cassandra.db->StorageService->Operations (I'm afraid there > is no nodetool hook so you will have to use jconsole). After that, try > starting a repair again. If that doesn't work, it's worth trying to > restart the node. > > -- > Sylvain