Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 785A2105A0 for ; Fri, 3 Jan 2014 05:00:47 +0000 (UTC) Received: (qmail 44424 invoked by uid 500); 3 Jan 2014 05:00:40 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 44341 invoked by uid 500); 3 Jan 2014 05:00:32 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 44333 invoked by uid 99); 3 Jan 2014 05:00:30 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 Jan 2014 05:00:30 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of narendra.sharma@gmail.com designates 209.85.223.169 as permitted sender) Received: from [209.85.223.169] (HELO mail-ie0-f169.google.com) (209.85.223.169) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 Jan 2014 05:00:26 +0000 Received: by mail-ie0-f169.google.com with SMTP id e14so15674776iej.14 for ; Thu, 02 Jan 2014 21:00:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=rUwwczklMu2K1px+92t2SmqVSujOStYN71154WaDir8=; b=cnIJ9CT3ON5Zh+PoO6/e9H1YFM/oCth9VmuZnuBnzwf0ngFPcdbfK3ubf3Ehdpk5dL w2WgO+N/7wrlAAR0o6NdEDDsc2ZEmJppyhhYLtxOdgSO2RsMJvfEYC11aSVhlXOUZGl/ 1huvj3pqwh9z5RX6X/cx83ISX/0y0JGghRUPZCR/39QEfepBPSOP2O2mf5RxAu/vBAoF u3JzJWFq8U2Ux1TbiYEgvMMllgLBuoVQZJ2AFMq68+n+1lFF6yA8WQ8vdN6TDuHnx45b husFqcSHz2mIdHMFxOWtBRenL42mgoJGpNk6KG9mMJXNUc4ycd0Zgyb7zc0vxPrr9tha H7vw== MIME-Version: 1.0 X-Received: by 10.43.163.3 with SMTP id mm3mr181812icc.63.1388725205734; Thu, 02 Jan 2014 21:00:05 -0800 (PST) Received: by 10.50.93.67 with HTTP; Thu, 2 Jan 2014 21:00:05 -0800 (PST) In-Reply-To: References: Date: Thu, 2 Jan 2014 21:00:05 -0800 Message-ID: Subject: Re: Cassandra 1.1.6 crash without any exception or error in log From: Narendra Sharma To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=001a11c2fafe941c7604ef09c8ac X-Virus-Checked: Checked by ClamAV on apache.org --001a11c2fafe941c7604ef09c8ac Content-Type: text/plain; charset=ISO-8859-1 The root cause turned out to be high heap. The Linux OOM Killer ( http://linux-mm.org/OOM_Killer) killed the process. It took some time to figure out but very interesting. We knew high heap is a problem but had no clue when the actual heap usage was well within limit and the process disappeared. syslog helped figure this out. About Linux OOM Killer "It is the job of the linux 'oom killer' to *sacrifice* one or more processes in order to free up memory for the system when all else fails" On Thu, Jan 2, 2014 at 10:38 AM, Robert Coli wrote: > On Thu, Jan 2, 2014 at 8:13 AM, Narendra Sharma > wrote: > >> 8 node cluster running in aws. Any pointers where I should start looking? >> No kill -9 in history. >> > You should start looking at instructions as to how to upgrade to at least > the top of the 1.1 line... :D > > =Rob > -- Narendra Sharma Software Engineer *http://www.aeris.com * *http://narendrasharma.blogspot.com/ * --001a11c2fafe941c7604ef09c8ac Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
The root cause turned out to be high heap. The Linux OOM K= iller (http://linux-mm.org/OOM_K= iller) killed the process. It took some time to figure out but very int= eresting. We knew high heap is a problem but had no clue when the actual he= ap usage was well within limit and the process disappeared. syslog helped f= igure this out.

About Linux OOM Killer
"It is the job of the linux 'oom k= iller' to sacrifice one or more processes in order to free up memory for the = system when all else fails"


On Thu,= Jan 2, 2014 at 10:38 AM, Robert Coli <rcoli@eventbrite.com> wrote:
=
On Thu, Jan 2, 2014 at 8:13 AM= , Narendra Sharma <narendra.sharma@gmail.com> wrote:=

8 node cluster running in aws= . Any pointers where I should start looking?
No kill -9 in history.

You should start looking at instructions as to how= to upgrade to at least the top of the 1.1 line... :D

=3DRob



--
Narendra Sharma
--001a11c2fafe941c7604ef09c8ac--