cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Randy Fradin (JIRA)" <j...@apache.org>
Subject [jira] [Created] (CASSANDRA-12965) StreamReceiveTask causing high CPU utilization during repair
Date Mon, 28 Nov 2016 20:08:58 GMT
Randy Fradin created CASSANDRA-12965:
----------------------------------------

             Summary: StreamReceiveTask causing high CPU utilization during repair
                 Key: CASSANDRA-12965
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12965
             Project: Cassandra
          Issue Type: Bug
            Reporter: Randy Fradin


During a full repair run, I observed one node in my cluster using 100% cpu (100% of all cores
on a 48-core machine). When I took a stack trace I found exactly 48 running StreamReceiveTask
threads. Each was in the same block of code in StreamReceiveTask.OnCompletionRunnable:
{noformat}
"StreamReceiveTask:8077" #1511134 daemon prio=5 os_prio=0 tid=0x00007f01520a8800 nid=0x6e77
runnable [0x00007f020dfae000]
   java.lang.Thread.State: RUNNABLE
        at java.util.ComparableTimSort.binarySort(ComparableTimSort.java:258)
        at java.util.ComparableTimSort.sort(ComparableTimSort.java:203)
        at java.util.Arrays.sort(Arrays.java:1312)
        at java.util.Arrays.sort(Arrays.java:1506)
        at java.util.ArrayList.sort(ArrayList.java:1454)
        at java.util.Collections.sort(Collections.java:141)
        at org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:257)
        at org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:280)
        at org.apache.cassandra.utils.IntervalTree.<init>(IntervalTree.java:72)
        at org.apache.cassandra.db.DataTracker$SSTableIntervalTree.<init>(DataTracker.java:590)
        at org.apache.cassandra.db.DataTracker$SSTableIntervalTree.<init>(DataTracker.java:584)
        at org.apache.cassandra.db.DataTracker.buildIntervalTree(DataTracker.java:565)
        at org.apache.cassandra.db.DataTracker$View.replace(DataTracker.java:761)
        at org.apache.cassandra.db.DataTracker.addSSTablesToTracker(DataTracker.java:428)
        at org.apache.cassandra.db.DataTracker.addSSTables(DataTracker.java:283)
        at org.apache.cassandra.db.ColumnFamilyStore.addSSTables(ColumnFamilyStore.java:1422)
        at org.apache.cassandra.streaming.StreamReceiveTask$OnCompletionRunnable.run(StreamReceiveTask.java:148)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
{noformat}

All 48 threads were in ColumnFamilyStore.addSSTables(), and specifically in the IntervalNode
constructor called from the IntervalTree constructor.

It stayed this way for maybe an hour before we restarted the node. The repair was also generating
thousands (20,000+) of tiny SSTables in a table that previously had just 20.

I don't know enough about SSTables and ColumnFamilyStore to know if all this CPU work is necessary
or a bug, but I did notice that these tasks are run on a thread pool constructed in StreamReceiveTask.java,
so perhaps this pool should have a thread count max less than the number of processors on
the machine, at least for machines with a lot of processors. Any reason not to do that? Any
ideas for a reasonable # or formula to cap the thread count?

Some additional info: We have never run incremental repair on this cluster, so that is not
a factor. All our tables use LCS. Unfortunately I don't have the log files from the period
saved.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message