cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benedict (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CASSANDRA-7704) FileNotFoundException during STREAM-OUT triggers 100% CPU usage
Date Wed, 06 Aug 2014 11:41:11 GMT

     [ https://issues.apache.org/jira/browse/CASSANDRA-7704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Benedict updated CASSANDRA-7704:
--------------------------------

    Attachment: 7704.txt

Attaching a patch that I think addresses this. There are a number of concurrency bugs here,
and whilst we could fix them with more advanced lock-freedom, there is no compelling reason
this class doesn't use synchronized everywhere, which would probably have avoided this problem
in the first place. There is only one place where the execution is not guaranteed to be prompt,
and I have left this out of the synchronization. I have at the same time simplified the logic,
and fixed the logic for cancelling timeouts, as well as made the scheduled executor for timeouts
globally shared (there's no good reason to spinup a new executor for each set of transfers)

In this particular instance the issue seems to have been a lack of atomicity between abort()
and complete(); an ACK arrived at the same time as abort() was cancelling all transfers, causing
a reference to be released twice. This could also occur with the timeouts, but since they
occur only every 12hrs, the risk is low.

> FileNotFoundException during STREAM-OUT triggers 100% CPU usage
> ---------------------------------------------------------------
>
>                 Key: CASSANDRA-7704
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7704
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Rick Branson
>         Attachments: 7704.txt, backtrace.txt
>
>
> See attached backtrace which was what triggered this. This stream failed and then ~12
seconds later it emitted that exception. At that point, all CPUs went to 100%. A thread dump
shows all the ReadStage threads stuck inside IntervalTree.searchInternal inside of CFS.markReferenced().



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message