Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 74582 invoked from network); 26 Jan 2011 00:56:29 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 26 Jan 2011 00:56:29 -0000 Received: (qmail 20259 invoked by uid 500); 26 Jan 2011 00:56:27 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 20214 invoked by uid 500); 26 Jan 2011 00:56:26 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 20206 invoked by uid 99); 26 Jan 2011 00:56:26 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 26 Jan 2011 00:56:26 +0000 X-ASF-Spam-Status: No, hits=1.5 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of dan.hendry.junk@gmail.com designates 209.85.212.44 as permitted sender) Received: from [209.85.212.44] (HELO mail-vw0-f44.google.com) (209.85.212.44) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 26 Jan 2011 00:56:21 +0000 Received: by vws7 with SMTP id 7so215048vws.31 for ; Tue, 25 Jan 2011 16:56:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:from:to:references:in-reply-to:subject:date :message-id:mime-version:content-type:x-mailer:thread-index :content-language; bh=aVviiZXppkJgTYzqL1NZuWShBc/AZpovIpDmPV6SQdg=; b=FUaZlLZugZADGE7h4Csrs+7d3prNYDjOKO/Y4maPsHxWqYm/BH79X5kGb9+k5t2ks9 6YpbTaHKy92wOq3oNGe+A1cokQW1lqHyrxmfgrJqAyRKqmQtNl37NhFIGcw1qSmI5IdM BDJHQij/Td2X9ngdkT2Wy+QJjXYjFe1KpJ9lA= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=from:to:references:in-reply-to:subject:date:message-id:mime-version :content-type:x-mailer:thread-index:content-language; b=qvxlRjWbvOOfjNkC8rNeCa0z8que+E1NmHn9dLeqLc99aVmAdiHejhjYKeiDuL6og6 OOdx1Ck2epAS3WAPCUJFwPbhwy5vNN+vSXTyGzq2NzIohRohAq8qt2owp4R3sQVNOB+1 MCK3qs13Ne810vlQkbcJknh2IsZZuVhGc9hmw= Received: by 10.220.182.66 with SMTP id cb2mr1647641vcb.270.1296003359993; Tue, 25 Jan 2011 16:55:59 -0800 (PST) Received: from DHTABLET (out-pq-254.wireless.telus.com [216.218.29.254]) by mx.google.com with ESMTPS id bq5sm4890532vcb.32.2011.01.25.16.55.55 (version=TLSv1/SSLv3 cipher=RC4-MD5); Tue, 25 Jan 2011 16:55:58 -0800 (PST) From: "Dan Hendry" To: References: <4d3c91f0.4407dc0a.16e5.ffff9c6d@mx.google.com> In-Reply-To: Subject: RE: Errors During Compaction Date: Tue, 25 Jan 2011 19:55:46 -0500 Message-ID: <4d3f711e.85b3dc0a.4fac.0732@mx.google.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_NextPart_000_0016_01CBBCC9.DEF270B0" X-Mailer: Microsoft Office Outlook 12.0 Thread-Index: Acu802KvW1gXktbtTLyMhRB+kiumzAAHhx6w Content-Language: en-ca This is a multi-part message in MIME format. ------=_NextPart_000_0016_01CBBCC9.DEF270B0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Limited joy I would say :) No long term damage at least. =20 I ended up deleting (moving to another disk) all the sstables which = fixed the problem. I ran in to even more problems during repair = (detailed in another recent email) but it seems to have worked = regardless. Just to be safe, I am in the process of starting a = =E2=80=98manual repair=E2=80=99 (copying SSTables from other nodes for = this particular CF then restarting and running a cleanup + major = compaction). =20 Any thoughts on what the root cause of this problem could be? It is = somewhat worrying that a CF can randomly become corrupt bringing down = the whole node. Cassandras handling of a corrupt CF (regardless of how = rare an occurrence) is less than elegant.=20 =20 Dan =20 From: Aaron Morton [mailto:aaron@thelastpickle.com]=20 Sent: January-25-11 16:03 To: user@cassandra.apache.org Subject: Re: Errors During Compaction =20 Dan how did you go with this? More joy, less joy or a continuation of = the current level of joy? =20 Aaron =20 On 24/01/2011, at 9:38 AM, Dan Hendry wrote: I have run into a strange problem and was hoping for suggestions on how = to fix it (0.7.0). When compaction occurs on one node for what appears = to be one specific column family, the following error pops up the = Cassandra log. Compaction apparently fails and temp files don=E2=80=99t = get cleaned up. After a while and what seems to be multiple failed = compactions on the CF, the node runs out of disk space and crashes. Not = sure if it is a related problem or a function of this being a heavily = used column family but after failing to compact, compaction restarts on = the same CF exacerbating the issue. =20 Problems with this specific node started earlier this weekend when it = crashed with and OOM error. This is quite surprising since my memtable = thresholds and GC settings have been tuned to run with quite a bit of = overhead during normal operation (max heap usage usually <=3D 10 GB on a = 12 GB heap, average usage of 6-8 GB). I could not find anything abnormal = in the logs which would prompt an OOM. =20 I will look things over tomorrow and try to provide a bit more = information on the problem but as a solution, I was going to wipe out = all SSTables for this CF on this node and then run a repair. Far from = ideal, is this a reasonable solution? =20 =20 ERROR [CompactionExecutor:1] 2011-01-23 14:10:29,855 = AbstractCassandraDaemon.java (line 91) Fatal exception in thread = Thread[CompactionExecutor:1,1,RMI Runtime] java.io.IOError: java.io.EOFException: attempted to skip -1983579368 = bytes but only skipped 0 at = org.apache.cassandra.io.sstable.SSTableIdentityIterator.(SSTableIde= ntityIterator.java:78) at = org.apache.cassandra.io.sstable.SSTableScanner$KeyScanningIterator.next(S= STableScanner.java:178) at = org.apache.cassandra.io.sstable.SSTableScanner$KeyScanningIterator.next(S= STableScanner.java:143) at = org.apache.cassandra.io.sstable.SSTableScanner.next(SSTableScanner.java:1= 35) at = org.apache.cassandra.io.sstable.SSTableScanner.next(SSTableScanner.java:3= 8) at = org.apache.commons.collections.iterators.CollatingIterator.set(CollatingI= terator.java:284) at = org.apache.commons.collections.iterators.CollatingIterator.least(Collatin= gIterator.java:326) at = org.apache.commons.collections.iterators.CollatingIterator.next(Collating= Iterator.java:230) at = org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.= java:68) at = com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractItera= tor.java:136) at = com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:= 131) at = org.apache.commons.collections.iterators.FilterIterator.setNextObject(Fil= terIterator.java:183) at = org.apache.commons.collections.iterators.FilterIterator.hasNext(FilterIte= rator.java:94) at = org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.= java:323) at = org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:1= 22) at = org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:9= 2) at = java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at = java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor= .java:886) at = java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.jav= a:908) at java.lang.Thread.run(Thread.java:662) Caused by: java.io.EOFException: attempted to skip -1983579368 bytes but = only skipped 0 at = org.apache.cassandra.io.sstable.IndexHelper.skipBloomFilter(IndexHelper.j= ava:52) at = org.apache.cassandra.io.sstable.SSTableIdentityIterator.(SSTableIde= ntityIterator.java:69) ... 20 more =20 Dan Hendry (403) 660-2297 =20 No virus found in this incoming message. Checked by AVG - www.avg.com Version: 9.0.872 / Virus Database: 271.1.1/3402 - Release Date: 01/25/11 = 02:34:00 ------=_NextPart_000_0016_01CBBCC9.DEF270B0 Content-Type: text/html; charset="utf-8" Content-Transfer-Encoding: quoted-printable

Limited joy I would say :) =C2=A0No long term damage at = least.

 

I ended up deleting (moving to another disk) all the sstables which = fixed the problem. I ran in to even more problems during repair = (detailed in another recent email) but it seems to have worked = regardless. Just to be safe, I am in the process of starting a = =E2=80=98manual repair=E2=80=99 (copying SSTables from other nodes for = this particular CF then restarting and running a cleanup + major = compaction).

 

Any thoughts on what the root cause of this problem could be? It is = somewhat worrying that a CF can randomly become corrupt bringing down = the whole node. Cassandras handling of a corrupt CF (regardless of how = rare an occurrence) is less than elegant.

 

Dan

 

From:= Aaron = Morton [mailto:aaron@thelastpickle.com]
Sent: January-25-11 = 16:03
To: user@cassandra.apache.org
Subject: Re: = Errors During Compaction

 

Dan how = did you go with this? More joy, less joy or a continuation of the = current level of joy?

 

Aaron

 


On 24/01/2011, at 9:38 AM, Dan Hendry = <dan.hendry.junk@gmail.com&g= t; wrote:

I have run = into a strange problem and was hoping for suggestions on how to fix it = (0.7.0). When compaction occurs on one node for what appears to be one = specific column family, the following error pops up the Cassandra log. = Compaction apparently fails and temp files don=E2=80=99t get cleaned up. = After a while and what seems to be multiple failed compactions on the = CF, the node runs out of disk space and crashes. Not sure if it is a = related problem or a function of this being a heavily used column family = but after failing to compact, compaction restarts on the same CF = exacerbating the issue.

 <= /o:p>

Problems = with this specific node started earlier this weekend when it crashed = with and OOM error. This is quite surprising since my memtable = thresholds and GC settings have been tuned to run with quite a bit of = overhead during normal operation (max heap usage usually <=3D 10 GB = on a 12 GB heap, average usage of 6-8 GB). I could not find anything = abnormal in the logs which would prompt an OOM.

 <= /o:p>

I will look = things over tomorrow and try to provide a bit more information on the = problem but as a solution, I was going to wipe out all SSTables for this = CF on this node and then run a repair. Far from ideal, is this a = reasonable solution?

 <= /o:p>

 <= /o:p>

ERROR = [CompactionExecutor:1] 2011-01-23 14:10:29,855 = AbstractCassandraDaemon.java (line 91) Fatal exception in thread = Thread[CompactionExecutor:1,1,RMI Runtime]

java.io.IOEr= ror: java.io.EOFException: attempted to skip -1983579368 bytes but only = skipped 0

  =       at = org.apache.cassandra.io.sstable.SSTableIdentityIterator.<init>(SSTa= bleIdentityIterator.java:78)

  =       at = org.apache.cassandra.io.sstable.SSTableScanner$KeyScanningIterator.next(S= STableScanner.java:178)

  =       at = org.apache.cassandra.io.sstable.SSTableScanner$KeyScanningIterator.next(S= STableScanner.java:143)

  =       at = org.apache.cassandra.io.sstable.SSTableScanner.next(SSTableScanner.java:1= 35)

  =       at = org.apache.cassandra.io.sstable.SSTableScanner.next(SSTableScanner.java:3= 8)

  =       at = org.apache.commons.collections.iterators.CollatingIterator.set(CollatingI= terator.java:284)

  =       at = org.apache.commons.collections.iterators.CollatingIterator.least(Collatin= gIterator.java:326)

  =       at = org.apache.commons.collections.iterators.CollatingIterator.next(Collating= Iterator.java:230)

  =       at = org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.= java:68)

  =       at = com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractItera= tor.java:136)

  =       at = com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:= 131)

  =       at = org.apache.commons.collections.iterators.FilterIterator.setNextObject(Fil= terIterator.java:183)

  =       at = org.apache.commons.collections.iterators.FilterIterator.hasNext(FilterIte= rator.java:94)

  =       at = org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.= java:323)

  =       at = org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:1= 22)

  =       at = org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:9= 2)

  =       at = java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)

  =      at = java.util.concurrent.FutureTask.run(FutureTask.java:138)

  =       at = java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor= .java:886)

  =       at = java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.jav= a:908)

  =       at = java.lang.Thread.run(Thread.java:662)

Caused by: = java.io.EOFException: attempted to skip -1983579368 bytes but only = skipped 0

  =       at = org.apache.cassandra.io.sstable.IndexHelper.skipBloomFilter(IndexHelper.j= ava:52)

  =       at = org.apache.cassandra.io.sstable.SSTableIdentityIterator.<init>(SSTa= bleIdentityIterator.java:69)

  =       ... 20 more

 <= /o:p>

Dan = Hendry

(403) = 660-2297

 <= /o:p>

No virus = found in this incoming message.
Checked by AVG - = www.avg.com
Version: 9.0.872 / Virus Database: 271.1.1/3402 - Release = Date: 01/25/11 02:34:00

------=_NextPart_000_0016_01CBBCC9.DEF270B0--