Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6FC5A75E1 for ; Thu, 1 Dec 2011 10:17:07 +0000 (UTC) Received: (qmail 35556 invoked by uid 500); 1 Dec 2011 10:17:05 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 35530 invoked by uid 500); 1 Dec 2011 10:17:05 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 35521 invoked by uid 99); 1 Dec 2011 10:17:05 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Dec 2011 10:17:04 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of sicoe.alexandru@googlemail.com designates 209.85.216.51 as permitted sender) Received: from [209.85.216.51] (HELO mail-qw0-f51.google.com) (209.85.216.51) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 01 Dec 2011 10:16:58 +0000 Received: by qadb12 with SMTP id b12so603224qad.10 for ; Thu, 01 Dec 2011 02:16:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; bh=npYsCkJCMowRJd12ArseAyiekj1/Z5Xw/yHoBTqPPUY=; b=uLdY+ty9o8eXKURM8ghlxAYdfmf2sJ59CrpFoYjJo1wbIzheYvPvyLS06IJ8VU7oh6 AgUrjbaERCFU9PdbO7TwsW1dZt+R9AvpIXALLz4Tm7o5jt7VhLALRZ+4WP7WAcJeWzx3 l8FT0ne7HP9k9EBr1cXwju+OSF+CzDazk0lsI= MIME-Version: 1.0 Received: by 10.229.41.1 with SMTP id m1mr1039016qce.299.1322734597501; Thu, 01 Dec 2011 02:16:37 -0800 (PST) Received: by 10.229.20.84 with HTTP; Thu, 1 Dec 2011 02:16:37 -0800 (PST) Date: Thu, 1 Dec 2011 11:16:37 +0100 Message-ID: Subject: Insufficient disk space to flush From: Alexandru Dan Sicoe To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=0016368340b6d0f22a04b305254b X-Virus-Checked: Checked by ClamAV on apache.org --0016368340b6d0f22a04b305254b Content-Type: text/plain; charset=ISO-8859-1 Hello everyone, 4 node Cassandra 0.8.5 cluster with RF =2. One node started throwing exceptions in its log: ERROR 10:02:46,837 Fatal exception in thread Thread[FlushWriter:1317,5,main] java.lang.RuntimeException: java.lang.RuntimeException: Insufficient disk space to flush 17296 bytes at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) Caused by: java.lang.RuntimeException: Insufficient disk space to flush 17296 bytes at org.apache.cassandra.db.ColumnFamilyStore.getFlushPath(ColumnFamilyStore.java:714) at org.apache.cassandra.db.ColumnFamilyStore.createFlushWriter(ColumnFamilyStore.java:2301) at org.apache.cassandra.db.Memtable.writeSortedContents(Memtable.java:246) at org.apache.cassandra.db.Memtable.access$400(Memtable.java:49) at org.apache.cassandra.db.Memtable$3.runMayThrow(Memtable.java:270) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) ... 3 more Checked disk and obviously it's 100% full. How do I recover from this without loosing the data? I've got plenty of space on the other nodes, so I thought of doing a decommission which I understand reassigns ranges to the other nodes and replicates data to them. After that's done I plan on manually deleting the data on the node and then joining in the same cluster position with auto-bootstrap turned off so that I won't get back the old data and I can continue getting new data with the node. Note, I would like to have 4 nodes in because the other three barely take the input load alone. These are just long running tests until I get some better machines. On strange thing I found is that the data folder on the ndoe that filled up the disk is 150 GB (as measured with du) while the data folder on all other 3 nodes is 50 GB. At the same time, DataStax OpsCenter shows a size of around 50GB for all 4 nodes. I though that the node was making a major compaction at which time it filled up the disk....but even that doesn't make sense because shouldn't a major compaction just be capable of doubling the size, not triple-ing it? Doesn anyone know how to explain this behavior? Thanks, Alex --0016368340b6d0f22a04b305254b Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hello everyone,
=A04 node Cassandra 0.8.5 cluster with RF =3D2.
=A0On= e node started throwing exceptions in its log:

ERROR 10:02:46,837 F= atal exception in thread Thread[FlushWriter:1317,5,main]
java.lang.Runti= meException: java.lang.RuntimeException: Insufficient disk space to flush 1= 7296 bytes
=A0=A0=A0=A0=A0=A0=A0 at org.apache.cassandra.utils.WrappedRunnable.run(Wra= ppedRunnable.java:34)
=A0=A0=A0=A0=A0=A0=A0 at java.util.concurrent.Thre= adPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
=A0=A0=A0=A0= =A0=A0=A0 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolE= xecutor.java:908)
=A0=A0=A0=A0=A0=A0=A0 at java.lang.Thread.run(Thread.java:619)
Caused by= : java.lang.RuntimeException: Insufficient disk space to flush 17296 bytes<= br>=A0=A0=A0=A0=A0=A0=A0 at org.apache.cassandra.db.ColumnFamilyStore.getFl= ushPath(ColumnFamilyStore.java:714)
=A0=A0=A0=A0=A0=A0=A0 at org.apache.cassandra.db.ColumnFamilyStore.createFl= ushWriter(ColumnFamilyStore.java:2301)
=A0=A0=A0=A0=A0=A0=A0 at org.apac= he.cassandra.db.Memtable.writeSortedContents(Memtable.java:246)
=A0=A0= =A0=A0=A0=A0=A0 at org.apache.cassandra.db.Memtable.access$400(Memtable.jav= a:49)
=A0=A0=A0=A0=A0=A0=A0 at org.apache.cassandra.db.Memtable$3.runMayThrow(Mem= table.java:270)
=A0=A0=A0=A0=A0=A0=A0 at org.apache.cassandra.utils.Wrap= pedRunnable.run(WrappedRunnable.java:30)
=A0=A0=A0=A0=A0=A0=A0 ... 3 mor= e

Checked disk and obviously it's 100% full.

How do I recover from this without loosing the data? I've got plent= y of space on the other nodes, so I thought of doing a decommission which I= understand reassigns ranges to the other nodes and replicates data to them= . After that's done I plan on manually deleting the data on the node an= d then joining in the same cluster position with auto-bootstrap turned off = so that I won't get back the old data and I can continue getting new da= ta with the node.

Note, I would like to have 4 nodes in because the other three barely ta= ke the input load alone. These are just long running tests until I get some= better machines.

On strange thing I found is that the= data folder on the ndoe that filled up the disk is 150 GB (as measured wit= h du) while the data folder on all other 3 nodes is 50 GB. At the same time= , DataStax OpsCenter shows a size of around 50GB for all 4 nodes. I though = that the node was making a major compaction at which time it filled up the = disk....but even that doesn't make sense because shouldn't a major = compaction just be capable of doubling the size, not triple-ing it? Doesn a= nyone know how to explain this behavior?

Thanks,
Alex

--0016368340b6d0f22a04b305254b--