cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Keith Wright <kwri...@nanigans.com>
Subject Re: Hot, large row
Date Thu, 24 Jul 2014 20:00:12 GMT
I appreciate the feedback, doing it on the client side is interesting and I will start looking
into that.  To be clear, here are the symptoms I am seeing:


 *   A node will start showing high load and the CMS collection time jumps to 100+ ms/sec
(per new is also up)
 *   If I run nodetool tpstats, I see a high number of items in the Pending phase for ReadStage.
 Other items mostly appear near empty.  In addition, I see dropped reads.
 *   If I shutdown the effected node, two other nodes show increased load.  I am at RF=3 so
this would again suggest slow reads against a single row

Does it sound correct that my best course of action is to investigate a large row?  If it
were a small row being called repeatedly, I assume OS/key cache would make that a VERY fast
operation.

Thanks

From: DuyHai Doan <doanduyhai@gmail.com<mailto:doanduyhai@gmail.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Thursday, July 24, 2014 at 3:53 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Re: Hot, large row

Your extract of cfhistograms show that there are no particular "wide rows". The widest has
61214 cells which is big but not that huge to be really a concern.

Turning on trace probabilty only tells give you some "hints" about what kind of queries are
done, it does not give the exact partition key nor other statement values, especially when
you are using prepared statements ...


"I am considering reducing read_request_timeout_in_ms: 5000 in cassandra.yaml so that it reduces
the impact when this occurs." --> Don't do that, you'll only sweep dust under the carpet.
Find the real issue and fix it instead of changing parameter to hide it.

One solution would be on client side, to activate some logging to show the CQL3 statements
the application is issuing that may overload the server.  I know that's better said than done
but I don't have any other idea for the moment

-------- Shameless self-promotion ------

To support this kind of live prod debugging & investigation that I add a new dynamic query
logging feature in Achilles: https://github.com/doanduyhai/Achilles/wiki/Statements-Logging-and-Tracing#dynamic-statements-logging

Once you hit the issue, this kind of feature may save your day...






On Thu, Jul 24, 2014 at 9:22 PM, Keith Wright <kwright@nanigans.com<mailto:kwright@nanigans.com>>
wrote:
I can see from cfhistograms that I do have some wide rows (see below).  I set trace probability
as you suggested but the output doesn’t appear to tell me what row was actually read unless
I missed something.  I just see executing prepared statement.   Any ideas how I can find the
row in question?

I am considering reducing read_request_timeout_in_ms: 5000 in cassandra.yaml so that it reduces
the impact when this occurs.

Any help in identifying my issue would be GREATLY appreciated


Cell Count per Partition

    1 cells: 50449950

    2 cells: 14281828

    3 cells: 8093366

    4 cells: 5029200

    5 cells: 3103023<tel:3103023>

    6 cells: 3059903

    7 cells: 1903018

    8 cells: 1509297

   10 cells: 2420359

   12 cells: 1624895

   14 cells: 1171678

   17 cells: 1289391

   20 cells: 909777

   24 cells: 852081

   29 cells: 722925

   35 cells: 587067

   42 cells: 459473

   50 cells: 358744

   60 cells: 304146

   72 cells: 244682

   86 cells: 191045

  103 cells: 155337

  124 cells: 127061

  149 cells: 98913

  179 cells: 77454

  215 cells: 59849

  258 cells: 46117

  310 cells: 35321

  372 cells: 26319

  446 cells: 19379

  535 cells: 13783

  642 cells: 9993

  770 cells: 6973

  924 cells: 4713

 1109 cells: 3229

 1331 cells: 2062

 1597 cells: 1338

 1916 cells: 773

 2299 cells: 495

 2759 cells: 268

 3311 cells: 150

 3973 cells: 100

 4768 cells: 42

 5722 cells: 24

 6866 cells: 12

 8239 cells: 9

 9887 cells: 3

11864 cells: 0

14237 cells: 5

17084 cells: 1

20501 cells: 0

24601 cells: 2

29521 cells: 0

35425 cells: 0

42510 cells: 0

51012 cells: 0

61214 cells: 2

From: DuyHai Doan <doanduyhai@gmail.com<mailto:doanduyhai@gmail.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Thursday, July 24, 2014 at 3:01 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Re: Hot, large row

"How can I detect wide rows?" -->

nodetool cfhistograms <keyspace> <suspected column family>

Look at column "Column count" (last column) and identify a line in this column with very high
value of "Offset". In a well designed application you should have a gaussian distribution
where 80% of your row have a similar number of columns.

"Anyone know what debug level I can set so that I can see what reads the hot node is handling?
 " -->

"nodetool settraceprobability <value>",  where value is a small number (0.001) on the
node where you encounter the issue. Activate the tracing for a while (5 mins) and deactivate
it (value = 0). Then look into system_traces tables "events" & "sessions". It may help
or not since the tracing is done once every 1000.

"Any way to get the server to blacklist these wide rows automatically?" --> No


On Thu, Jul 24, 2014 at 8:48 PM, Keith Wright <kwright@nanigans.com<mailto:kwright@nanigans.com>>
wrote:
Hi all,

   We are seeing an issue where basically daily one of our nodes spikes in load and is churning
in CMS heap pressure.  It appears that reads are backing up and my guess is that our application
is reading a large row repeatedly.  Our write structure can lead itself to wide rows very
infrequently (<0.001%) and we do our best to detect and delete them but obviously we’re
missing a case.  Hoping for assistance on the following questions:

 *   How can I detect wide rows?
 *   Anyone know what debug level I can set so that I can see what reads the hot node is handling?
 I’m hoping to see the “bad” row
 *   Any way to get the server to blacklist these wide rows automatically?

We’re using C* 2.0.6 with Vnodes.

Thanks



Mime
View raw message