From cassandra-user-return-1352-apmail-incubator-cassandra-user-archive=incubator.apache.org@incubator.apache.org Mon Nov 16 18:02:47 2009 Return-Path: Delivered-To: apmail-incubator-cassandra-user-archive@minotaur.apache.org Received: (qmail 17003 invoked from network); 16 Nov 2009 18:02:47 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 16 Nov 2009 18:02:47 -0000 Received: (qmail 91002 invoked by uid 500); 16 Nov 2009 18:02:47 -0000 Delivered-To: apmail-incubator-cassandra-user-archive@incubator.apache.org Received: (qmail 90977 invoked by uid 500); 16 Nov 2009 18:02:47 -0000 Mailing-List: contact cassandra-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: cassandra-user@incubator.apache.org Delivered-To: mailing list cassandra-user@incubator.apache.org Received: (qmail 90968 invoked by uid 99); 16 Nov 2009 18:02:47 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 16 Nov 2009 18:02:47 +0000 X-ASF-Spam-Status: No, hits=-1.8 required=10.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [15.201.24.18] (HELO g4t0015.houston.hp.com) (15.201.24.18) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 16 Nov 2009 18:02:36 +0000 Received: from G3W0631.americas.hpqcorp.net (g3w0631.americas.hpqcorp.net [16.233.59.15]) (using TLSv1 with cipher RC4-MD5 (128/128 bits)) (No client certificate requested) by g4t0015.houston.hp.com (Postfix) with ESMTPS id 783CF8454 for ; Mon, 16 Nov 2009 18:02:14 +0000 (UTC) Received: from G3W0055.americas.hpqcorp.net (16.232.1.152) by G3W0631.americas.hpqcorp.net (16.233.59.15) with Microsoft SMTP Server (TLS) id 8.2.176.0; Mon, 16 Nov 2009 18:01:26 +0000 Received: from GVW0432EXB.americas.hpqcorp.net ([16.234.32.145]) by G3W0055.americas.hpqcorp.net ([16.232.1.152]) with mapi; Mon, 16 Nov 2009 18:01:26 +0000 From: "Freeman, Tim" To: "cassandra-user@incubator.apache.org" Date: Mon, 16 Nov 2009 17:59:39 +0000 Subject: RE: Timeout Exception Thread-Topic: Timeout Exception Thread-Index: Acpm5NczBII+59hyTPWKfdJLCpIFzgAABTtA Message-ID: <59DD1BA8FD3C0F4C90771C18F2B5B53A4C842D4E4A@GVW0432EXB.americas.hpqcorp.net> References: <35bb42690911092025l109b871exa58ff629d624e299@mail.gmail.com> <35bb42690911101123y795c80erb18c2091fe960ae2@mail.gmail.com> <35bb42690911101149i18fcc590v1cbc2ba9b2b99356@mail.gmail.com> <35bb42690911101153y3a998431se86a64613f31b030@mail.gmail.com> <35bb42690911160946pb37f763x52666a890ded9a91@mail.gmail.com> In-Reply-To: <35bb42690911160946pb37f763x52666a890ded9a91@mail.gmail.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US Content-Type: multipart/alternative; boundary="_000_59DD1BA8FD3C0F4C90771C18F2B5B53A4C842D4E4AGVW0432EXBame_" MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org --_000_59DD1BA8FD3C0F4C90771C18F2B5B53A4C842D4E4AGVW0432EXBame_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable I'm running 0.4.1. I used to get timeouts, then I changed my timeout from = 5 seconds to 30 seconds and I get no more timeouts. The relevant line from= storage-conf.xml is: 30000 The maximum latency is often just over 5 seconds in the worst case when I f= etch thousands of records, so default timeout of 5 seconds happens to be a = little bit too low for me. My records are ~100Kbytes each. You may get di= fferent results if your records are much larger or much smaller. The other issue I was having a few days ago was that the machine was page f= aulting so garbage collections were taking forever. Some GC's took 20 minu= tes in another Java process. I didn't have verbose:gc turned on in Cassand= ra so I'm not sure what the score was there, but there's little reason to e= xpect it to be qualitatively better, since it's pretty random which process= gets some of its pages swapped out. On a Linux machine, run "vmstat 5" wh= en your machine is loaded and if you see numbers greater than 0 in the "si"= and "so" columns in rows after the first, tell one of your Java processes = to take less memory. Tim Freeman Email: tim.freeman@hp.com Desk in Palo Alto: (650) 857-2581 Home: (408) 774-1298 Cell: (408) 348-7536 (No reception business hours Monday, Tuesday, and Thur= sday; call my desk instead.) From: Chris Were [mailto:chris.were@gmail.com] Sent: Monday, November 16, 2009 9:47 AM To: Jonathan Ellis Cc: cassandra-user@incubator.apache.org Subject: Re: Timeout Exception I turned on debug logging for a few days and timeouts happened across prett= y much all requests. I couldn't see any particular request that was consist= ently the problem. After some experimenting it seems that shutting down cassandra and restarti= ng resolves the problem. Once it hits the JVM memory limit however, the tim= eouts start again. I have read the page on MemTable thresholds and have tri= ed thresholds of 32MB, 64MB and 128MB with no noticeable difference. Cassan= dra is set to use 7GB of memory. I have 12 CF's, however only 6 of those ha= ve lots of data. Cheers, Chris On Tue, Nov 10, 2009 at 11:55 AM, Jonathan Ellis > wrote: if you're timing out doing a slice on 10 columns w/ 10% cpu used, something is broken is it consistent as to which keys this happens on? try turning on debug logging and seeing where the latency is coming from. On Tue, Nov 10, 2009 at 1:53 PM, Chris Were > wrote: > > On Tue, Nov 10, 2009 at 11:50 AM, Jonathan Ellis > wrote: >> >> On Tue, Nov 10, 2009 at 1:49 PM, Chris Were > wrote: >> > Maybe... but it's not just multigets, it also happens when retreiving >> > one >> > row with get_slice. >> >> how many of the 3M columns are you trying to slice at once? > > Sorry, I must have mixed up the terminology. > There's ~3M keys, but less than 10 columns in each. The get_slice calls a= re > to retreive all the columns (10) for a given key. --_000_59DD1BA8FD3C0F4C90771C18F2B5B53A4C842D4E4AGVW0432EXBame_ Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable

I'm running 0.4.1.  I used to get timeouts, then I chan= ged my timeout from 5 seconds to 30 seconds and I get no more timeouts.  T= he relevant line from storage-conf.xml is:

 

  <RpcTimeoutInMillis>30000</RpcTimeoutInMillis>

 

The maximum latency is often just over 5 seconds in the wors= t case when I fetch thousands of records, so default timeout of 5 seconds hap= pens to be a little bit too low for me.  My records are ~100Kbytes each.&nb= sp; You may get different results if your records are much larger or much small= er.

 

The other issue I was having a few days ago was that the mac= hine was page faulting so garbage collections were taking forever.  Some GC= 's took 20 minutes in another Java process.  I didn't have verbose:gc tur= ned on in Cassandra so I'm not sure what the score was there, but there's littl= e reason to expect it to be qualitatively better, since it's pretty random wh= ich process gets some of its pages swapped out.  On a Linux machine, run "vmstat 5" when your machine is loaded and if you see numbers gre= ater than 0 in the "si" and "so" columns in rows after the f= irst, tell one of your Java processes to take less memory.

 

Tim Fre= eman
Email: tim.freeman@hp.com
Desk in Palo Alto: (650) 857-2581
Home: (408) 774-1298
Cell: (408) 348-7536 (No reception business hours Monday, Tuesday, and Thursday; call my desk instead.)

 

From: Chris Were [mailto:chris.were@gmail.com]
Sent: Monday, November 16, 2009 9:47 AM
To: Jonathan Ellis
Cc: cassandra-user@incubator.apache.org
Subject: Re: Timeout Exception

 

I turned on debug logging for a few days and timeouts happened across pretty much all requests. I couldn't see any particular req= uest that was consistently the problem.

 

After some experimenting it seems that shutting down cassandra and restarting resolves the problem. Once it hits the JVM memory limit however, the timeouts start again. I have read the page on MemTable thresholds and have tried thresholds of 32MB, 64MB and 128MB with no notice= able difference. Cassandra is set to use 7GB of memory. I have 12 CF's, however = only 6 of those have lots of data.

 

Cheers,

Chris

On Tue, Nov 10, 2009 at 11:55 AM, Jonathan Ellis <<= a href=3D"mailto:jbellis@gmail.com">jbellis@gmail.com> wrote:

if you're timing out doing a slice on 10 columns w/ 10= % cpu used,
something is broken

is it consistent as to which keys this happens on?  try turning on
debug logging and seeing where the latency is coming from.


On Tue, Nov 10, 2009 at 1:53 PM, Chris Were <chris.were@gmail.com> wrote: >
> On Tue, Nov 10, 2009 at 11:50 AM, Jonathan Ellis <jbellis@gmail.com> wrote:
>>
>> On Tue, Nov 10, 2009 at 1:49 PM, Chris Were <chris.were@gmail.com> wrote: >> > Maybe... but it's not just multigets, it also happens when retreiving
>> > one
>> > row with get_slice.
>>
>> how many of the 3M columns are you trying to slice at once?
>
> Sorry, I must have mixed up the terminology.
> There's ~3M keys, but less than 10 columns in each. The get_slice call= s are
> to retreive all the columns (10) for a given key.

 

--_000_59DD1BA8FD3C0F4C90771C18F2B5B53A4C842D4E4AGVW0432EXBame_--