Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 12956 invoked from network); 8 Oct 2010 03:32:53 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 8 Oct 2010 03:32:53 -0000 Received: (qmail 91827 invoked by uid 500); 8 Oct 2010 03:32:51 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 91739 invoked by uid 500); 8 Oct 2010 03:32:51 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 91731 invoked by uid 99); 8 Oct 2010 03:32:50 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 08 Oct 2010 03:32:50 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of dan@ec2.dustbunnytycoon.com designates 184.73.189.133 as permitted sender) Received: from [184.73.189.133] (HELO ec2.dustbunnytycoon.com) (184.73.189.133) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 08 Oct 2010 03:32:41 +0000 Received: from DHTABLET (unsynced.acceleratorcentre.net [209.222.173.253]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by ec2.dustbunnytycoon.com (Postfix) with ESMTPSA id 5BBE11B0DEC; Thu, 7 Oct 2010 20:32:20 -0700 (PDT) From: "Dan Hendry" To: Subject: Out of Memory Issues - SERIOUS Date: Thu, 7 Oct 2010 23:32:14 -0400 Message-ID: <008a01cb6699$66d81de0$348859a0$@dustbunnytycoon.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_NextPart_000_008B_01CB6677.DFC67DE0" X-Mailer: Microsoft Office Outlook 12.0 Thread-Index: ActmmWYs/e2ZqFV4TA+9ZKQjZNtBNA== Content-Language: en-ca X-Virus-Checked: Checked by ClamAV on apache.org This is a multi-part message in MIME format. ------=_NextPart_000_008B_01CB6677.DFC67DE0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit There seems to have been a fair amount of discussion on memory related issues so I apologize if this exact situation has come up before. I am currently in the process of load testing an metrics platform I have written which uses Cassandra and I have run into some very troubling issues. The application is writing quite heavily, about 1000-2000 updates (columns) per second using batch mutates of 20 columns each. This is divided between creating new rows and adding columns to a fairly limited number of existing index rows (<30). Nearly all of these updates are read within 10 seconds and none contain any significant amount of data (generally much less than 100 bytes of data which I specify). Initially, the test hums along nicely but after some amount of time (1-2 hours) Cassandra crashes with an out of memory error. Unfortunately I have not had the opportunity to watch the test as it crashes, but it has happened in 2/2 tests. This is quite annoying but the absolutely TERRIFYING behaviour is that when I restart Cassandra, it starts replaying the commit logs then crashes with an out of memory error again. Restart a second time, crash with OOM; it seems to get through about 3/4 of the commit logs. Just to be absolutely explicit, I am not trying to insert or read at this point, just recover the previous updates. Unless somebody can suggest a way to recover the commit logs, I have effectively lost my data. The only way I have found to recover is wipe the data directories. It does not matter right now given that it is only a test but this behaviour is completely unacceptable for a production system. Here is information about the system which is probably relevant. Let me know if any additional details about my application would help sort out this issue: - Cassandra 0.7 Beta2 - DB Machine: EC2 m1 large with the commit log directory on an ebs and the data directory on ephemeral storage. - OS: Ubuntu server 10.04 - With the exception of changing JMX settings, no memory or JVM changes were made to options in cassandra-env.sh - In Cassandra.yaml, I reduced binary_memtable_throughput_in_mb to 100 in my second test to try follow the heap memory calculation formula; I have 8 column families. - I am using the Sun JVM, specifically "build 1.6.0_20-b02" - The app is written in java and I am using the latest Pelops library, I am sending updates at consistency level ONE and reading them at level ALL. I have been fairly impressed with Cassandra overall and given that I am using a beta version, I don't expect fully polished behaviour. What is unacceptable, and quite frankly nearly unbelievable, is the fact Cassandra cant seem to recover from the error and I am loosing data. Dan Hendry ------=_NextPart_000_008B_01CB6677.DFC67DE0 Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable

There seems to have been a fair amount of = discussion on memory related issues so I apologize if this exact situation has come up before.

 

I am currently in the process of load testing an = metrics platform I have written which uses Cassandra and I have run into some very = troubling issues. The application is writing quite heavily, about 1000-2000 = updates (columns) per second using batch mutates of 20 columns each. This is divided = between creating new rows and adding columns to a fairly limited number of existing index = rows (<30). Nearly all of these updates are read within 10 seconds and = none contain any significant amount of data (generally much less than 100 = bytes of data which I specify). Initially, the test hums along nicely but after = some amount of time (1-2 hours) Cassandra crashes with an out of memory = error. Unfortunately I have not had the opportunity to watch the test as it crashes, but it = has happened in 2/2 tests.

 

This is quite annoying but the absolutely = TERRIFYING behaviour is that when I restart Cassandra, it starts replaying the commit logs = then crashes with an out of memory error again. Restart a second time, crash = with OOM; it seems to get through about 3/4 of the commit logs. Just to be = absolutely explicit, I am not trying to insert or read at this point, just recover = the previous updates. Unless somebody can suggest a way to recover the = commit logs, I have effectively lost my data. The only way I have found to recover is = wipe the data directories. It does not matter right now given that it is only a = test but this behaviour is completely unacceptable for a production system. =

 

Here is information about the system which is = probably relevant. Let me know if any additional details about my application = would help sort out this issue:

-          Cassandra 0.7 Beta2

-          DB Machine: EC2 m1 large with the commit log = directory on an ebs and the data directory on ephemeral storage.

-          OS: Ubuntu server 10.04

-          With the exception of changing JMX settings, no = memory or JVM changes were made to options in cassandra-env.sh

-          In Cassandra.yaml, I reduced = binary_memtable_throughput_in_mb to 100 in my second test to try follow the heap memory calculation = formula; I have 8 column families.

-          I am using the Sun JVM, specifically = “build 1.6.0_20-b02”

-          The app is written in java and I am using the = latest Pelops library, I am sending updates at consistency level ONE and = reading them at level ALL.

 

I have been fairly impressed with Cassandra overall = and given that I am using a beta version, I don’t expect fully polished = behaviour. What is unacceptable, and quite frankly nearly unbelievable, is the fact = Cassandra cant seem to recover from the error and I am loosing data.

 

Dan Hendry

------=_NextPart_000_008B_01CB6677.DFC67DE0--