cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thakrar, Jayesh" <jthak...@conversantmedia.com>
Subject Re:
Date Fri, 29 Sep 2017 19:31:43 GMT
A couple of concerns Dan:
- such a large memtable size implies potentially longer GC periods.
  Besides the long pauses that it can cause, since memtables are the ones that facilitate
fast writes in Cassandra, you may potentially also impact your write throughput occasionally/intermittently.
- Big memtable flushes may also put pressure on your I/O and unless you have spread that out
across all your 12 drives, it can become a bottleneck, especially if your drives are not SSDs.

If you can reproduce this problem (or if it doesn't go away), you might also want to look
at your memtable activity - e.g. using nodetool tablestats

From: Dan Kinder <dkinder@turnitin.com>
Date: Friday, September 29, 2017 at 1:57 PM
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Cc: "Thakrar, Jayesh" <jthakrar@conversantmedia.com>
Subject: Re:

There are 9 non-system tables.

This machine has 64GB of memory and the memtable heap is deliberately high to reduce compaction
overhead.

So far I'm hopeful, I am seeing much higher resource from the nodes now (closer to expected).

On Fri, Sep 29, 2017 at 11:34 AM, Jeff Jirsa <jjirsa@gmail.com<mailto:jjirsa@gmail.com>>
wrote:
Aside from that a 16gb memtable is huge and probably creating a ton of objects in heap. How
many tables are in your cluster?

--
Jeff Jirsa


On Sep 29, 2017, at 11:18 AM, Dan Kinder <dkinder@turnitin.com<mailto:dkinder@turnitin.com>>
wrote:
I have a theory I'm testing out, after looking at the thread stack. Here it is:

Carried over from before the upgrade, I have
memtable_heap_space_in_mb: 16384 # 16GB
memtable_cleanup_threshold: 0.11
memtable_allocation_type: heap_buffers

But since I'm on a jbod with 12 disks, I'm letting memtable_flush_writers default to 2 per
disk. So this would calculate a cleanup threshold of 1/(24+1) = 0.04 by default. Which is
very different from 0.11.

Perhaps something about this configuration makes it lock up trying to allocate memtables?

On Fri, Sep 29, 2017 at 10:35 AM, Thakrar, Jayesh <jthakrar@conversantmedia.com<mailto:jthakrar@conversantmedia.com>>
wrote:
This looks like a potential disk / IO issue.

See https://www.pythian.com/blog/guide-to-cassandra-thread-pools/


From: Dan Kinder <dkinder@turnitin.com<mailto:dkinder@turnitin.com>>
Date: Friday, September 29, 2017 at 11:47 AM
To: <user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Re:

Attached a jstack thread dump from a node that's having trouble (see tpstats paste as well)


Pool Name                         Active   Pending      Completed   Blocked  All time blocked

ReadStage                              0         0         409111         0              
  0

MiscStage                              0         0              0         0              
  0

CompactionExecutor                    10        10           2481         0              
  0

MutationStage                         32   1572402       31515477         0              
  0

GossipStage                            1      1620         180091         0              
  0

RequestResponseStage                   0         0       17509006         0              
  0

ReadRepairStage                        0         0           4833         0              
  0

CounterMutationStage                   0         0              0         0              
  0

MemtableFlushWriter                    1        44             74         0              
  0

MemtablePostFlush                      1       281            158         0              
  0

ValidationExecutor                     0         0              0         0              
  0

ViewMutationStage                      0         0              0         0              
  0

CacheCleanupExecutor                   0         0              0         0              
  0

PerDiskMemtableFlushWriter_10          0         0             67         0              
  0

PerDiskMemtableFlushWriter_11          0         0             67         0              
  0

MemtableReclaimMemory                  0         0             79         0              
  0

PendingRangeCalculator                 0         0             19         0              
  0

SecondaryIndexManagement               0         0              0         0              
  0

HintsDispatcher                        1         1           1456         0              
  0

PerDiskMemtableFlushWriter_1           0         0             67         0              
  0

Native-Transport-Requests             69         0       17951984         0             26956

PerDiskMemtableFlushWriter_2           0         0             67         0              
  0

MigrationStage                         1       201           1048         0              
  0

PerDiskMemtableFlushWriter_0           0         0             78         0              
  0

Sampler                                0         0              0         0              
  0

PerDiskMemtableFlushWriter_5           0         0             67         0              
  0

InternalResponseStage                  0         0        2754163         0              
  0

PerDiskMemtableFlushWriter_6           0         0             67         0              
  0

PerDiskMemtableFlushWriter_3           0         0             67         0              
  0

PerDiskMemtableFlushWriter_4           0         0             67         0              
  0

PerDiskMemtableFlushWriter_9           0         0             67         0              
  0

AntiEntropyStage                       0         0              0         0              
  0

PerDiskMemtableFlushWriter_7           0         0             67         0              
  0

PerDiskMemtableFlushWriter_8           0         0             67         0              
  0



Message type           Dropped

READ                         0

RANGE_SLICE                  0

_TRACE                       0

HINT                         0

MUTATION                     0

COUNTER_MUTATION             0

BATCH_STORE                  0

BATCH_REMOVE                 0

REQUEST_RESPONSE             0

PAGED_RANGE                  0

READ_REPAIR                  0

On Thu, Sep 28, 2017 at 2:27 PM, Dan Kinder <dkinder@turnitin.com<mailto:dkinder@turnitin.com>>
wrote:
Unless I'm reading the logs wrong, it doesn't seem like the pauses are very long... Attached
the full log as well since node restart.

$ grep -i "pause\|times" /var/log/cassandra/gc.log.0.cu<http://gc.log.0.cu>rrent | tail
-n 40
2017-09-28T14:18:48.324-0700: 12866.695: [GC pause (G1 Evacuation Pause) (young)
 [Times: user=0.33 sys=0.00, real=0.02 secs]
2017-09-28T14:18:58.198-0700: 12876.569: [GC pause (G1 Evacuation Pause) (young)
 [Times: user=0.34 sys=0.00, real=0.03 secs]
2017-09-28T14:19:08.039-0700: 12886.411: [GC pause (G1 Evacuation Pause) (young)
 [Times: user=0.30 sys=0.00, real=0.03 secs]
2017-09-28T14:19:17.900-0700: 12896.271: [GC pause (G1 Evacuation Pause) (young)
 [Times: user=0.33 sys=0.00, real=0.03 secs]
2017-09-28T14:19:28.995-0700: 12907.366: [GC pause (G1 Evacuation Pause) (young)
 [Times: user=0.31 sys=0.00, real=0.02 secs]
2017-09-28T14:19:40.576-0700: 12918.947: [GC pause (G1 Evacuation Pause) (young)
 [Times: user=0.31 sys=0.01, real=0.02 secs]
2017-09-28T14:19:52.164-0700: 12930.535: [GC pause (G1 Evacuation Pause) (young)
 [Times: user=0.32 sys=0.01, real=0.02 secs]
2017-09-28T14:20:03.813-0700: 12942.184: [GC pause (G1 Evacuation Pause) (young)
 [Times: user=0.33 sys=0.00, real=0.02 secs]
2017-09-28T14:20:15.584-0700: 12953.955: [GC pause (G1 Evacuation Pause) (young)
 [Times: user=0.35 sys=0.00, real=0.02 secs]
2017-09-28T14:20:27.261-0700: 12965.632: [GC pause (G1 Evacuation Pause) (young)
 [Times: user=0.34 sys=0.00, real=0.03 secs]
2017-09-28T14:20:38.894-0700: 12977.265: [GC pause (G1 Evacuation Pause) (young)
 [Times: user=0.34 sys=0.01, real=0.03 secs]
2017-09-28T14:20:50.685-0700: 12989.056: [GC pause (G1 Evacuation Pause) (young)
 [Times: user=0.33 sys=0.00, real=0.03 secs]
2017-09-28T14:21:02.281-0700: 13000.652: [GC pause (G1 Evacuation Pause) (young)
 [Times: user=0.31 sys=0.00, real=0.02 secs]
2017-09-28T14:21:13.975-0700: 13012.346: [GC pause (G1 Evacuation Pause) (young)
 [Times: user=0.32 sys=0.00, real=0.02 secs]
2017-09-28T14:21:25.797-0700: 13024.168: [GC pause (G1 Evacuation Pause) (young)
 [Times: user=0.32 sys=0.01, real=0.02 secs]
2017-09-28T14:21:37.395-0700: 13035.766: [GC pause (G1 Evacuation Pause) (young)
 [Times: user=0.33 sys=0.00, real=0.02 secs]
2017-09-28T14:21:49.166-0700: 13047.537: [GC pause (G1 Evacuation Pause) (young)
 [Times: user=0.32 sys=0.00, real=0.02 secs]
2017-09-28T14:22:00.887-0700: 13059.258: [GC pause (G1 Evacuation Pause) (young)
 [Times: user=0.33 sys=0.00, real=0.03 secs]
2017-09-28T14:22:12.453-0700: 13070.824: [GC pause (G1 Evacuation Pause) (young)
 [Times: user=0.34 sys=0.00, real=0.02 secs]
2017-09-28T14:22:24.023-0700: 13082.394: [GC pause (G1 Evacuation Pause) (young)
 [Times: user=0.35 sys=0.00, real=0.02 secs]


On Thu, Sep 28, 2017 at 2:13 PM, Jeff Jirsa <jjirsa@gmail.com<mailto:jjirsa@gmail.com>>
wrote:
The digest mismatch exception is not a problem, that's why it's only logged at debug.

As Thomas noted, there's a pretty good chance this is https://issues.apache.org/jira/browse/CASSANDRA-13754
- if you see a lot of GCInspector logs indicating GC pauses, that would add confidence to
that diagnosis.


On Thu, Sep 28, 2017 at 2:08 PM, Dan Kinder <dkinder@turnitin.com<mailto:dkinder@turnitin.com>>
wrote:
Thanks for the responses.

@Prem yes this is after the entire cluster is on 3.11, but no I did not run upgradesstables
yet.

@Thomas no I don't see any major GC going on.

@Jeff yeah it's fully upgraded. I decided to shut the whole thing down and bring it back (thankfully
this cluster is not serving live traffic). The nodes seemed okay for an hour or two, but I
see the issue again, without me bouncing any nodes. This time it's ReadStage that's building
up, and the exception I'm seeing in the logs is:

DEBUG [ReadRepairStage:106] 2017-09-28 13:01:37,206 ReadCallback.java:242 - Digest mismatch:

org.apache.cassandra.service.DigestMismatchException: Mismatch for key DecoratedKey(6150926370328526396,
696a6374652e6f7267) (2f0fffe2d743cdc4c69c3eb351a3c9ca vs 00ee661ae190c2cbf0eb2fb8a51f6025)

        at org.apache.cassandra.service.DigestResolver.compareResponses(DigestResolver.java:92)
~[apache-cassandra-3.11.0.jar:3.11.0]

        at org.apache.cassandra.service.ReadCallback$AsyncRepairRunner.run(ReadCallback.java:233)
~[apache-cassandra-3.11.0.jar:3.11.0]

        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[na:1.8.0_71]

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[na:1.8.0_71]

        at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
[apache-cassandra-3.11.0.jar:3.11.0]

        at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_71]



Do you think running upgradesstables would help? Or relocatesstables? I presumed it shouldn't
be necessary for Cassandra to function, just an optimization.

On Thu, Sep 28, 2017 at 12:49 PM, Steinmaurer, Thomas <thomas.steinmaurer@dynatrace.com<mailto:thomas.steinmaurer@dynatrace.com>>
wrote:
Dan,

do you see any major GC? We have been hit by the following memory leak in our loadtest environment
with 3.11.0.
https://issues.apache.org/jira/browse/CASSANDRA-13754

So, depending on the heap size and uptime, you might get into heap troubles.

Thomas

From: Dan Kinder [mailto:dkinder@turnitin.com<mailto:dkinder@turnitin.com>]
Sent: Donnerstag, 28. September 2017 18:20
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject:


Hi,

I recently upgraded our 16-node cluster from 2.2.6 to 3.11 and see the following. The cluster
does function, for a while, but then some stages begin to back up and the node does not recover
and does not drain the tasks, even under no load. This happens both to MutationStage and GossipStage.

I do see the following exception happen in the logs:



ERROR [ReadRepairStage:2328] 2017-09-26 23:07:55,440 CassandraDaemon.java:228 - Exception
in thread Thread[ReadRepairStage:2328,5,main]

org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only
1 responses.

        at org.apache.cassandra.service.DataResolver$RepairMergeListener.close(DataResolver.java:171)
~[apache-cassandra-3.11.0.jar:3.11.0]

        at org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$2.close(UnfilteredPartitionIterators.java:182)
~[apache-cassandra-3.11.0.jar:3.11.0]

        at org.apache.cassandra.db.transform.BaseIterator.close(BaseIterator.java:82) ~[apache-cassandra-3.11.0.jar:3.11.0]

        at org.apache.cassandra.service.DataResolver.compareResponses(DataResolver.java:89)
~[apache-cassandra-3.11.0.jar:3.11.0]

        at org.apache.cassandra.service.AsyncRepairCallback$1.runMayThrow(AsyncRepairCallback.java:50)
~[apache-cassandra-3.11.0.jar:3.11.0]

        at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ~[apache-cassandra-3.11.0.jar:3.11.0]

        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
~[na:1.8.0_91]

        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
~[na:1.8.0_91]

        at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
~[apache-cassandra-3.11.0.jar:3.11.0]

        at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_91]



But it's hard to correlate precisely with things going bad. It is also very strange to me
since I have both read_repair_chance and dclocal_read_repair_chance set to 0.0 for ALL of
my tables. So it is confusing why ReadRepairStage would err.

Anyone have thoughts on this? It's pretty muddling, and causes nodes to lock up. Once it happens
Cassandra can't even shut down, I have to kill -9. If I can't find a resolution I'm going
to need to downgrade and restore to backup...

The only issue I found that looked similar is https://issues.apache.org/jira/browse/CASSANDRA-12689
but that appears to be fixed by 3.10.



$ nodetool tpstats

Pool Name                         Active   Pending      Completed   Blocked  All time blocked

ReadStage                              0         0         582103         0              
  0

MiscStage                              0         0              0         0              
  0

CompactionExecutor                    11        11           2868         0              
  0

MutationStage                         32   4593678       55057393         0              
  0

GossipStage                            1      2818         371487         0              
  0

RequestResponseStage                   0         0        4345522         0              
  0

ReadRepairStage                        0         0         151473         0              
  0

CounterMutationStage                   0         0              0         0              
  0

MemtableFlushWriter                    1        81             76         0              
  0

MemtablePostFlush                      1       382            139         0              
  0

ValidationExecutor                     0         0              0         0              
  0

ViewMutationStage                      0         0              0         0              
  0

CacheCleanupExecutor                   0         0              0         0              
  0

PerDiskMemtableFlushWriter_10          0         0             69         0              
  0

PerDiskMemtableFlushWriter_11          0         0             69         0              
  0

MemtableReclaimMemory                  0         0             81         0              
  0

PendingRangeCalculator                 0         0             32         0              
  0

SecondaryIndexManagement               0         0              0         0              
  0

HintsDispatcher                        0         0            596         0              
  0

PerDiskMemtableFlushWriter_1           0         0             69         0              
  0

Native-Transport-Requests             11         0        4547746         0              
 67

PerDiskMemtableFlushWriter_2           0         0             69         0              
  0

MigrationStage                         1      1545            586         0              
  0

PerDiskMemtableFlushWriter_0           0         0             80         0              
  0

Sampler                                0         0              0         0              
  0

PerDiskMemtableFlushWriter_5           0         0             69         0              
  0

InternalResponseStage                  0         0          45432         0              
  0

PerDiskMemtableFlushWriter_6           0         0             69         0              
  0

PerDiskMemtableFlushWriter_3           0         0             69         0              
  0

PerDiskMemtableFlushWriter_4           0         0             69         0              
  0

PerDiskMemtableFlushWriter_9           0         0             69         0              
  0

AntiEntropyStage                       0         0              0         0              
  0

PerDiskMemtableFlushWriter_7           0         0             69         0              
  0

PerDiskMemtableFlushWriter_8           0         0             69         0              
  0



Message type           Dropped

READ                         0

RANGE_SLICE                  0

_TRACE                       0

HINT                         0

MUTATION                     0

COUNTER_MUTATION             0

BATCH_STORE                  0

BATCH_REMOVE                 0

REQUEST_RESPONSE             0

PAGED_RANGE                  0

READ_REPAIR                  0


-dan
The contents of this e-mail are intended for the named addressee only. It contains information
that may be confidential. Unless you are the named addressee or an authorized designee, you
may not copy or use it, or disclose it to anyone else. If you received it in error please
notify us immediately and then destroy it. Dynatrace Austria GmbH (registration number FN
91482h) is a company registered in Linz whose registered office is at 4040 Linz, Austria,
Freistädterstraße 313<https://maps.google.com/?q=4040+Linz,+Austria,+Freist%C3%A4dterstra%C3%9Fe+313&entry=gmail&source=g>



--
Dan Kinder
Principal Software Engineer
Turnitin – www.turnitin.com<http://www.turnitin.com>
dkinder@turnitin.com<mailto:dkinder@turnitin.com>




--
Dan Kinder
Principal Software Engineer
Turnitin – www.turnitin.com<http://www.turnitin.com>
dkinder@turnitin.com<mailto:dkinder@turnitin.com>



--
Dan Kinder
Principal Software Engineer
Turnitin – www.turnitin.com<http://www.turnitin.com>
dkinder@turnitin.com<mailto:dkinder@turnitin.com>




--
Dan Kinder
Principal Software Engineer
Turnitin – www.turnitin.com<http://www.turnitin.com>
dkinder@turnitin.com<mailto:dkinder@turnitin.com>
Mime
View raw message