Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0E347189AB for ; Thu, 18 Feb 2016 17:22:58 +0000 (UTC) Received: (qmail 53976 invoked by uid 500); 18 Feb 2016 17:22:54 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 53936 invoked by uid 500); 18 Feb 2016 17:22:54 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 53925 invoked by uid 99); 18 Feb 2016 17:22:54 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 18 Feb 2016 17:22:54 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id D74E51805FB for ; Thu, 18 Feb 2016 17:22:53 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.7 X-Spam-Level: * X-Spam-Status: No, score=1.7 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=2, KAM_LOTSOFHASH=0.25, RCVD_IN_DNSWL_LOW=-0.7] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=yahoo.co.in Received: from mx2-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id TMonX-xLQjyE for ; Thu, 18 Feb 2016 17:22:49 +0000 (UTC) Received: from nm30-vm4.bullet.mail.sg3.yahoo.com (nm30-vm4.bullet.mail.sg3.yahoo.com [106.10.151.179]) by mx2-lw-eu.apache.org (ASF Mail Server at mx2-lw-eu.apache.org) with ESMTPS id D0D1D5FDEC for ; Thu, 18 Feb 2016 17:22:47 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.co.in; s=s2048; t=1455816159; bh=Zq8q40xSnmQgHIjv886fT7DDwtx/1cKEqXaGaYRTx+0=; h=Date:From:To:In-Reply-To:References:Subject:From:Subject; b=JleQqKRdR+2pfa0AuvQlbFnyxxpETtPrIR52OBrvCdoNg9h/9AvMPa3xJF78Wi1wxtnyUE2Dtbb8Rl+MxLktk1kgNwh7Vtqrd4Cha9GTtTm7VbnTaiIb5njO1XNOH/opN69hk6TAVU2XGg+jgdYv2/aw46OpJWcr+G1uknjp2sSLPuhOVy9wND2r0HcH6391IPaJ4U2o6InZUhxq/HuW9teFhQot0eY02jKNbDQv5GfZ1byrfJknVXvB4vjLQFvBpTpdUe6s0UdykOC3BBkf11l1JwNisqs0KIDbym3ajbj0sWBwEXf4CTYMpDL4G++w3JHyuS30VskYr8x0KqqrHA== Received: from [106.10.166.122] by nm30.bullet.mail.sg3.yahoo.com with NNFMP; 18 Feb 2016 17:22:39 -0000 Received: from [106.10.151.123] by tm11.bullet.mail.sg3.yahoo.com with NNFMP; 18 Feb 2016 17:22:39 -0000 Received: from [127.0.0.1] by omp1005.mail.sg3.yahoo.com with NNFMP; 18 Feb 2016 17:22:39 -0000 X-Yahoo-Newman-Property: ymail-3 X-Yahoo-Newman-Id: 155178.61993.bm@omp1005.mail.sg3.yahoo.com Received: by 106.10.196.90; Thu, 18 Feb 2016 17:22:38 +0000 Date: Thu, 18 Feb 2016 17:22:37 +0000 (UTC) From: Anuj Wadehra To: "user@cassandra.apache.org" Message-ID: <1931757529.5253989.1455816157535.JavaMail.yahoo@mail.yahoo.com> In-Reply-To: References: Subject: Re: Debugging write timeouts on Cassandra 2.2.5 MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_5253988_1618601814.1455816157526" ------=_Part_5253988_1618601814.1455816157526 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Whats the GC overhead? Can you your share your GC collector and settings ? Whats your query pattern? Do you use secondary indexes, batches, in clause = etc? Anuj Sent from Yahoo Mail on Android=20 =20 On Thu, 18 Feb, 2016 at 8:45 pm, Mike Heffner wrote: = Alain, Thanks for the suggestions. Sure, tpstats are here:=C2=A0https://gist.github.com/mheffner/a979ae1a03044= 80b052a. Looking at the metrics across the ring, there were no blocked task= s nor dropped messages. Iowait metrics look fine, so it doesn't appear to be blocking on disk. Simi= larly, there are no long GC pauses. We haven't noticed latency on any particular table higher than others or co= rrelated around the occurrence of a timeout. We have noticed with further t= esting that running cassandra-stress against the ring, while our workload i= s writing to the same ring, will incur similar 10 second timeouts. If our w= orkload is not writing to the ring, cassandra stress will run without hitti= ng timeouts. This seems to imply that our workload pattern is causing somet= hing to block cluster-wide, since the stress tool writes to a different key= space then our workload. I mentioned in another reply that we've tracked it to something between 2.0= .x and 2.1.x, so we are focusing on narrowing which point release it was in= troduced in. Cheers, Mike On Thu, Feb 18, 2016 at 3:33 AM, Alain RODRIGUEZ wrote= : Hi Mike, What about the output of tpstats ? I imagine you have dropped messages ther= e. Any blocked threads ? Could you paste this output here ? May this be due to some network hiccup to access the disks as they are EBS = ? Can you think of anyway of checking this ? Do you have a lot of GC logs, = how long are the pauses (use something like: grep -i 'GCInspector' /var/log= /cassandra/system.log) ? Something else you could check are local_writes stats to see if only one ta= ble if affected or this is keyspace / cluster wide. You can use metrics exp= osed by cassandra or if you have no dashboards I believe a: 'nodetool cfsta= ts | grep -e 'Table:' -e 'Local'' should give you a rough idea of lo= cal latencies. Those are just things I would check, I have not a clue on what is happening= here, hope this will help. C*heers,-----------------Alain RodriguezFrance The Last Picklehttp://www.thelastpickle.com 2016-02-18 5:13 GMT+01:00 Mike Heffner : Jaydeep, No, we don't use any light weight transactions. Mike On Wed, Feb 17, 2016 at 6:44 PM, Jaydeep Chovatia wrote: Are you guys using light weight transactions in your write path? On Thu, Feb 11, 2016 at 12:36 AM, Fabrice Facorat wrote: Are your commitlog and data on the same disk ? If yes, you should put commitlogs on a separate disk which don't have a lot of IO. Others IO may have great impact impact on your commitlog writing and it may even block. An example of impact IO may have, even for Async writes: https://engineering.linkedin.com/blog/2016/02/eliminating-large-jvm-gc-paus= es-caused-by-background-io-traffic 2016-02-11 0:31 GMT+01:00 Mike Heffner : > Jeff, > > We have both commitlog and data on a 4TB EBS with 10k IOPS. > > Mike > > On Wed, Feb 10, 2016 at 5:28 PM, Jeff Jirsa > wrote: >> >> What disk size are you using? >> >> >> >> From: Mike Heffner >> Reply-To: "user@cassandra.apache.org" >> Date: Wednesday, February 10, 2016 at 2:24 PM >> To: "user@cassandra.apache.org" >> Cc: Peter Norton >> Subject: Re: Debugging write timeouts on Cassandra 2.2.5 >> >> Paulo, >> >> Thanks for the suggestion, we ran some tests against CMS and saw the sam= e >> timeouts. On that note though, we are going to try doubling the instance >> sizes and testing with double the heap (even though current usage is low= ). >> >> Mike >> >> On Wed, Feb 10, 2016 at 3:40 PM, Paulo Motta >> wrote: >>> >>> Are you using the same GC settings as the staging 2.0 cluster? If not, >>> could you try using the default GC settings (CMS) and see if that chang= es >>> anything? This is just a wild guess, but there were reports before of >>> G1-caused instabilities with small heap sizes (< 16GB - see CASSANDRA-1= 0403 >>> for more context). Please ignore if you already tried reverting back to= CMS. >>> >>> 2016-02-10 16:51 GMT-03:00 Mike Heffner : >>>> >>>> Hi all, >>>> >>>> We've recently embarked on a project to update our Cassandra >>>> infrastructure running on EC2. We are long time users of 2.0.x and are >>>> testing out a move to version 2.2.5 running on VPC with EBS. Our test = setup >>>> is a 3 node, RF=3D3 cluster supporting a small write load (mirror of o= ur >>>> staging load). >>>> >>>> We are writing at QUORUM and while p95's look good compared to our >>>> staging 2.0.x cluster, we are seeing frequent write operations that ti= me out >>>> at the max write_request_timeout_in_ms (10 seconds). CPU across the cl= uster >>>> is < 10% and EBS write load is < 100 IOPS. Cassandra is running with t= he >>>> Oracle JDK 8u60 and we're using G1GC and any GC pauses are less than 5= 00ms. >>>> >>>> We run on c4.2xl instances with GP2 EBS attached storage for data and >>>> commitlog directories. The nodes are using EC2 enhanced networking and= have >>>> the latest Intel network driver module. We are running on HVM instance= s >>>> using Ubuntu 14.04.2. >>>> >>>> Our schema is 5 tables, all with COMPACT STORAGE. Each table is simila= r >>>> to the definition here: >>>> https://gist.github.com/mheffner/4d80f6b53ccaa24cc20a >>>> >>>> This is our cassandra.yaml: >>>> https://gist.github.com/mheffner/fea80e6e939dd483f94f#file-cassandra-y= aml >>>> >>>> Like I mentioned we use 8u60 with G1GC and have used many of the GC >>>> settings in Al Tobey's tuning guide. This is our upstart config with J= VM and >>>> other CPU settings: https://gist.github.com/mheffner/dc44613620b25c4fa= 46d >>>> >>>> We've used several of the sysctl settings from Al's guide as well: >>>> https://gist.github.com/mheffner/ea40d58f58a517028152 >>>> >>>> Our client application is able to write using either Thrift batches >>>> using Asytanax driver or CQL async INSERT's using the Datastax Java dr= iver. >>>> >>>> For testing against Thrift (our legacy infra uses this) we write batch= es >>>> of anywhere from 6 to 1500 rows at a time. Our p99 for batch execution= is >>>> around 45ms but our maximum (p100) sits less than 150ms except when it >>>> periodically spikes to the full 10seconds. >>>> >>>> Testing the same write path using CQL writes instead demonstrates >>>> similar behavior. Low p99s except for periodic full timeouts. We enabl= ed >>>> tracing for several operations but were unable to get a trace that com= pleted >>>> successfully -- Cassandra started logging many messages as: >>>> >>>> INFO=C2=A0 [ScheduledTasks:1] - MessagingService.java:946 - _TRACE mes= sages >>>> were dropped in last 5000 ms: 52499 for internal timeout and 0 for cro= ss >>>> node timeout >>>> >>>> And all the traces contained rows with a "null" source_elapsed row: >>>> https://gist.githubusercontent.com/mheffner/1d68a70449bd6688a010/raw/0= 327d7d3d94c3a93af02b64212e3b7e7d8f2911b/trace.out >>>> >>>> >>>> We've exhausted as many configuration option permutations that we can >>>> think of. This cluster does not appear to be under any significant loa= d and >>>> latencies seem to largely fall in two bands: low normal or max timeout= . This >>>> seems to imply that something is getting stuck and timing out at the m= ax >>>> write timeout. >>>> >>>> Any suggestions on what to look for? We had debug enabled for awhile b= ut >>>> we didn't see any msg that pointed to something obvious. Happy to prov= ide >>>> any more information that may help. >>>> >>>> We are pretty much at the point of sprinkling debug around the code to >>>> track down what could be blocking. >>>> >>>> >>>> Thanks, >>>> >>>> Mike >>>> >>>> -- >>>> >>>>=C2=A0 =C2=A0Mike Heffner >>>>=C2=A0 =C2=A0Librato, Inc. >>>> >>> >> >> >> >> -- >> >>=C2=A0 =C2=A0Mike Heffner >>=C2=A0 =C2=A0Librato, Inc. >> > > > > -- > >=C2=A0 =C2=A0Mike Heffner >=C2=A0 =C2=A0Librato, Inc. > -- Close the World, Open the Net http://www.linux-wizard.net --=20 =C2=A0=C2=A0Mike Heffner =C2=A0=C2=A0Librato, Inc. --=20 =C2=A0=C2=A0Mike Heffner =C2=A0=C2=A0Librato, Inc. =20 ------=_Part_5253988_1618601814.1455816157526 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Whats the GC overhead? Can you your share your GC collector and settings ?<= div id=3D"yMail_cursorElementTracker_0.5958905699662864">

Whats your query pattern? Do yo= u use secondary indexes, batches, in clause etc?


Anuj



On Thu, 18 Feb, 2016 at 8:45 pm, Mike Heffn= er
<mike@librato.com> wrote:
Alain,

Thanks for the= suggestions.

Sure, tp= stats are here: https://gist.= github.com/mheffner/a979ae1a0304480b052a. Looking at the metrics across= the ring, there were no blocked tasks nor dropped messages.

Iowait metrics look fine, so it doesn't appear to= be blocking on disk. Similarly, there are no long GC pauses.
We haven't noticed latency on any particular tab= le higher than others or correlated around the occurrence of a timeout. We = have noticed with further testing that running cassandra-stress against the= ring, while our workload is writing to the same ring, will incur similar 1= 0 second timeouts. If our workload is not writing to the ring, cassandra st= ress will run without hitting timeouts. This seems to imply that our worklo= ad pattern is causing something to block cluster-wide, since the stress too= l writes to a different keyspace then our workload.

I mentioned in another reply that we've tracked it t= o something between 2.0.x and 2.1.x, so we are focusing on narrowing which = point release it was introduced in.

Cheers,

Mike

On Thu, Feb 18, 2016 at 3:33 AM, = Alain RODRIGUEZ <arodrime@gmail.com> wrote:
Hi Mike,

What about the output of tpstats ? I imagine you have dropped messag= es there. Any blocked threads ? Could you paste this output here ?

May this be due to some network hiccup to a= ccess the disks as they are EBS ? Can you think of anyway of checking this = ? Do you have a lot of GC logs, how long are the pauses (use something like= : grep -i 'GCInspector' /var/log/cassandra/system.log) ?

Something else you could check are local_writes stats= to see if only one table if affected or this is keyspace / cluster wide. Y= ou can use metrics exposed by cassandra or if you have no dashboards I beli= eve a: 'nodetool cfstats <myks> | grep -e 'Table:' -e 'Local'' should= give you a rough idea of local latencies.

Those are just things I would check, I h= ave not a clue on what is happening here, hope this will help.
C*heers,
-----------------
=
Alain Rodriguez
France

The Last Pickle
2016-02-18 5:13 GMT+01:00 Mike = Heffner <mike= @librato.com>:
Jaydeep,

No, we don'= t use any light weight transactions.

Mike

On We= d, Feb 17, 2016 at 6:44 PM, Jaydeep Chovatia <chovatia.jaydeep@gmail.com= > wrote:
Are you guys using light weight transactions in your write path?=

On Thu, Feb 11, 2016 at 12:36 AM, Fabrice Facorat <fabrice.faco= rat@gmail.com> wrote:
Are your commitlog and data on the same disk ? If yes, you shoul= d put
commitlogs on a separate disk which don't have a lot of IO.

Others IO may have great impact impact on your commitlog writing and
it may even block.

An example of impact IO may have, even for Async writes:
https://engineering.linkedin.com/blog/2016/02/elimina= ting-large-jvm-gc-pauses-caused-by-background-io-traffic

2016-02-11 0:31 GMT+01:00 Mike Heffner <mike@librato.com>:
> Jeff,
>
> We have both commitlog and data on a 4TB EBS with 10k IOPS.
>
> Mike
>
> On Wed, Feb 10, 2016 at 5:28 PM, Jeff Jirsa <jeff.jirsa@crowdstrike.com>
> wrote:
>>
>> What disk size are you using?
>>
>>
>>
>> From: Mike Heffner
>> Reply-To: "user@= cassandra.apache.org"
>> Date: Wednesday, February 10, 2016 at 2:24 PM
>> To: "user@cassan= dra.apache.org"
>> Cc: Peter Norton
>> Subject: Re: Debugging write timeouts on Cassandra 2.2.5
>>
>> Paulo,
>>
>> Thanks for the suggestion, we ran some tests against CMS and saw t= he same
>> timeouts. On that note though, we are going to try doubling the in= stance
>> sizes and testing with double the heap (even though current usage = is low).
>>
>> Mike
>>
>> On Wed, Feb 10, 2016 at 3:40 PM, Paulo Motta <pauloricardomg@gmail.com>
>> wrote:
>>>
>>> Are you using the same GC settings as the staging 2.0 cluster?= If not,
>>> could you try using the default GC settings (CMS) and see if t= hat changes
>>> anything? This is just a wild guess, but there were reports be= fore of
>>> G1-caused instabilities with small heap sizes (< 16GB - see= CASSANDRA-10403
>>> for more context). Please ignore if you already tried revertin= g back to CMS.
>>>
>>> 2016-02-10 16:51 GMT-03:00 Mike Heffner <mike@librato.com>:
>>>>
>>>> Hi all,
>>>>
>>>> We've recently embarked on a project to update our Cassand= ra
>>>> infrastructure running on EC2. We are long time users of 2= .0.x and are
>>>> testing out a move to version 2.2.5 running on VPC with EB= S. Our test setup
>>>> is a 3 node, RF=3D3 cluster supporting a small write load = (mirror of our
>>>> staging load).
>>>>
>>>> We are writing at QUORUM and while p95's look good compare= d to our
>>>> staging 2.0.x cluster, we are seeing frequent write operat= ions that time out
>>>> at the max write_request_timeout_in_ms (10 seconds). CPU a= cross the cluster
>>>> is < 10% and EBS write load is < 100 IOPS. Cassandra= is running with the
>>>> Oracle JDK 8u60 and we're using G1GC and any GC pauses are= less than 500ms.
>>>>
>>>> We run on c4.2xl instances with GP2 EBS attached storage f= or data and
>>>> commitlog directories. The nodes are using EC2 enhanced ne= tworking and have
>>>> the latest Intel network driver module. We are running on = HVM instances
>>>> using Ubuntu 14.04.2.
>>>>
>>>> Our schema is 5 tables, all with COMPACT STORAGE. Each tab= le is similar
>>>> to the definition here:
>>>> https://gist.git= hub.com/mheffner/4d80f6b53ccaa24cc20a
>>>>
>>>> This is our cassandra.yaml:
>>>> https://gist.github.com/mheffner/fea80e6e939dd483f94f#file-cassandra-ya= ml
>>>>
>>>> Like I mentioned we use 8u60 with G1GC and have used many = of the GC
>>>> settings in Al Tobey's tuning guide. This is our upstart c= onfig with JVM and
>>>> other CPU settings: https://gist.github.com/mheffner/dc44613620b25c4fa46d
>>>>
>>>> We've used several of the sysctl settings from Al's guide = as well:
>>>> https://gist.git= hub.com/mheffner/ea40d58f58a517028152
>>>>
>>>> Our client application is able to write using either Thrif= t batches
>>>> using Asytanax driver or CQL async INSERT's using the Data= stax Java driver.
>>>>
>>>> For testing against Thrift (our legacy infra uses this) we= write batches
>>>> of anywhere from 6 to 1500 rows at a time. Our p99 for bat= ch execution is
>>>> around 45ms but our maximum (p100) sits less than 150ms ex= cept when it
>>>> periodically spikes to the full 10seconds.
>>>>
>>>> Testing the same write path using CQL writes instead demon= strates
>>>> similar behavior. Low p99s except for periodic full timeou= ts. We enabled
>>>> tracing for several operations but were unable to get a tr= ace that completed
>>>> successfully -- Cassandra started logging many messages as= :
>>>>
>>>> INFO  [ScheduledTasks:1] - MessagingService.java:946 = - _TRACE messages
>>>> were dropped in last 5000 ms: 52499 for internal timeout a= nd 0 for cross
>>>> node timeout
>>>>
>>>> And all the traces contained rows with a "null" source_ela= psed row:
>>>> https://gist.githubuserco= ntent.com/mheffner/1d68a70449bd6688a010/raw/0327d7d3d94c3a93af02b64212e3b7e= 7d8f2911b/trace.out
>>>>
>>>>
>>>> We've exhausted as many configuration option permutations = that we can
>>>> think of. This cluster does not appear to be under any sig= nificant load and
>>>> latencies seem to largely fall in two bands: low normal or= max timeout. This
>>>> seems to imply that something is getting stuck and timing = out at the max
>>>> write timeout.
>>>>
>>>> Any suggestions on what to look for? We had debug enabled = for awhile but
>>>> we didn't see any msg that pointed to something obvious. H= appy to provide
>>>> any more information that may help.
>>>>
>>>> We are pretty much at the point of sprinkling debug around= the code to
>>>> track down what could be blocking.
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Mike
>>>>
>>>> --
>>>>
>>>>   Mike Heffner <mike@librato.com>
>>>>   Librato, Inc.
>>>>
>>>
>>
>>
>>
>> --
>>
>>   Mike Heffner <mike@librato.com>
>>   Librato, Inc.
>>
>
>
>
> --
>
>   Mike Heffner <= mike@librato.com>
>   Librato, Inc.
>



--
Close the World, Open the Net
http://www.linux-wizard.net



--

&= nbsp; Mike Heffner <mike@l= ibrato.com>
  Librato, Inc.




<= div>
--

  Mike Heffner <mike@librato.com>
  Li= brato, Inc.

------=_Part_5253988_1618601814.1455816157526--