cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Randy Fradin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-12965) StreamReceiveTask causing high CPU utilization during repair
Date Tue, 18 Jul 2017 01:58:00 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-12965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16090973#comment-16090973
] 

Randy Fradin commented on CASSANDRA-12965:
------------------------------------------

We set -Dcassandra.available_processors=(some number less than the # of cores on the host)
as suggested by lieangsibin. It limits the size of several thread pools including this one.
Not exactly a fix but at least prevents Cassandra from monopolizing all of the CPU resources.

> StreamReceiveTask causing high CPU utilization during repair
> ------------------------------------------------------------
>
>                 Key: CASSANDRA-12965
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12965
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Randy Fradin
>
> During a full repair run, I observed one node in my cluster using 100% cpu (100% of all
cores on a 48-core machine). When I took a stack trace I found exactly 48 running StreamReceiveTask
threads. Each was in the same block of code in StreamReceiveTask.OnCompletionRunnable:
> {noformat}
> "StreamReceiveTask:8077" #1511134 daemon prio=5 os_prio=0 tid=0x00007f01520a8800 nid=0x6e77
runnable [0x00007f020dfae000]
>    java.lang.Thread.State: RUNNABLE
>         at java.util.ComparableTimSort.binarySort(ComparableTimSort.java:258)
>         at java.util.ComparableTimSort.sort(ComparableTimSort.java:203)
>         at java.util.Arrays.sort(Arrays.java:1312)
>         at java.util.Arrays.sort(Arrays.java:1506)
>         at java.util.ArrayList.sort(ArrayList.java:1454)
>         at java.util.Collections.sort(Collections.java:141)
>         at org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:257)
>         at org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:280)
>         at org.apache.cassandra.utils.IntervalTree.<init>(IntervalTree.java:72)
>         at org.apache.cassandra.db.DataTracker$SSTableIntervalTree.<init>(DataTracker.java:590)
>         at org.apache.cassandra.db.DataTracker$SSTableIntervalTree.<init>(DataTracker.java:584)
>         at org.apache.cassandra.db.DataTracker.buildIntervalTree(DataTracker.java:565)
>         at org.apache.cassandra.db.DataTracker$View.replace(DataTracker.java:761)
>         at org.apache.cassandra.db.DataTracker.addSSTablesToTracker(DataTracker.java:428)
>         at org.apache.cassandra.db.DataTracker.addSSTables(DataTracker.java:283)
>         at org.apache.cassandra.db.ColumnFamilyStore.addSSTables(ColumnFamilyStore.java:1422)
>         at org.apache.cassandra.streaming.StreamReceiveTask$OnCompletionRunnable.run(StreamReceiveTask.java:148)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> {noformat}
> All 48 threads were in ColumnFamilyStore.addSSTables(), and specifically in the IntervalNode
constructor called from the IntervalTree constructor.
> It stayed this way for maybe an hour before we restarted the node. The repair was also
generating thousands (20,000+) of tiny SSTables in a table that previously had just 20.
> I don't know enough about SSTables and ColumnFamilyStore to know if all this CPU work
is necessary or a bug, but I did notice that these tasks are run on a thread pool constructed
in StreamReceiveTask.java, so perhaps this pool should have a thread count max less than the
number of processors on the machine, at least for machines with a lot of processors. Any reason
not to do that? Any ideas for a reasonable # or formula to cap the thread count?
> Some additional info: We have never run incremental repair on this cluster, so that is
not a factor. All our tables use LCS. Unfortunately I don't have the log files from the period
saved.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org


Mime
View raw message