cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "liangsibin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-12965) StreamReceiveTask causing high CPU utilization during repair
Date Tue, 28 Feb 2017 09:22:45 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-12965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15887642#comment-15887642
] 

liangsibin commented on CASSANDRA-12965:
----------------------------------------

maybe we can add -Dcassandra.available_processors=20 to lower the StreamReceiveTask threads
 when cassandra startup.

> StreamReceiveTask causing high CPU utilization during repair
> ------------------------------------------------------------
>
>                 Key: CASSANDRA-12965
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12965
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Randy Fradin
>
> During a full repair run, I observed one node in my cluster using 100% cpu (100% of all
cores on a 48-core machine). When I took a stack trace I found exactly 48 running StreamReceiveTask
threads. Each was in the same block of code in StreamReceiveTask.OnCompletionRunnable:
> {noformat}
> "StreamReceiveTask:8077" #1511134 daemon prio=5 os_prio=0 tid=0x00007f01520a8800 nid=0x6e77
runnable [0x00007f020dfae000]
>    java.lang.Thread.State: RUNNABLE
>         at java.util.ComparableTimSort.binarySort(ComparableTimSort.java:258)
>         at java.util.ComparableTimSort.sort(ComparableTimSort.java:203)
>         at java.util.Arrays.sort(Arrays.java:1312)
>         at java.util.Arrays.sort(Arrays.java:1506)
>         at java.util.ArrayList.sort(ArrayList.java:1454)
>         at java.util.Collections.sort(Collections.java:141)
>         at org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:257)
>         at org.apache.cassandra.utils.IntervalTree$IntervalNode.<init>(IntervalTree.java:280)
>         at org.apache.cassandra.utils.IntervalTree.<init>(IntervalTree.java:72)
>         at org.apache.cassandra.db.DataTracker$SSTableIntervalTree.<init>(DataTracker.java:590)
>         at org.apache.cassandra.db.DataTracker$SSTableIntervalTree.<init>(DataTracker.java:584)
>         at org.apache.cassandra.db.DataTracker.buildIntervalTree(DataTracker.java:565)
>         at org.apache.cassandra.db.DataTracker$View.replace(DataTracker.java:761)
>         at org.apache.cassandra.db.DataTracker.addSSTablesToTracker(DataTracker.java:428)
>         at org.apache.cassandra.db.DataTracker.addSSTables(DataTracker.java:283)
>         at org.apache.cassandra.db.ColumnFamilyStore.addSSTables(ColumnFamilyStore.java:1422)
>         at org.apache.cassandra.streaming.StreamReceiveTask$OnCompletionRunnable.run(StreamReceiveTask.java:148)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> {noformat}
> All 48 threads were in ColumnFamilyStore.addSSTables(), and specifically in the IntervalNode
constructor called from the IntervalTree constructor.
> It stayed this way for maybe an hour before we restarted the node. The repair was also
generating thousands (20,000+) of tiny SSTables in a table that previously had just 20.
> I don't know enough about SSTables and ColumnFamilyStore to know if all this CPU work
is necessary or a bug, but I did notice that these tasks are run on a thread pool constructed
in StreamReceiveTask.java, so perhaps this pool should have a thread count max less than the
number of processors on the machine, at least for machines with a lot of processors. Any reason
not to do that? Any ideas for a reasonable # or formula to cap the thread count?
> Some additional info: We have never run incremental repair on this cluster, so that is
not a factor. All our tables use LCS. Unfortunately I don't have the log files from the period
saved.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message