cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Li Zou (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-5932) Speculative read performance data show unexpected results
Date Mon, 30 Sep 2013 18:05:26 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-5932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13782027#comment-13782027
] 

Li Zou edited comment on CASSANDRA-5932 at 9/30/13 6:04 PM:
------------------------------------------------------------

Hello [~iamaleksey] and [~jbellis],

It appears to me that the testing results have suggested that the "_data read + speculative
retry_" path work as expected. This "_data read + speculative retry_" path has greatly minimized
the throughput impact caused by the failure of one of Cassandra server nodes.

The observed small degradation of throughput performance when _speculative retry_ is enabled
is very likely to be caused by the "*_read repair_*" path. I did the code reading of this
path last Friday and noticed some design / coding issues. I would like to discuss them with
you.

Please note that my code base is still the Cassandra 2.0.0 tarball, not updated with the latest
code changes.

*Issue 1* -- When handling {{DigestMismatchException}} in {{StorageProxy.fetchRows()}}, all
_data read requests_ are sent out using {{sendRR}} without distinguishing remote nodes from
the local node.

Will this cause an issue, as {{MessagingService.instance().sendRR()}} will send out enqueued
messages for a specified remote node via its pre-established TCP socket connection. For local
node, this should be done via {{LocalReadRunnable}}, i.e. {{StageManager.getStage(Stage.READ).execute(new
LocalReadRunnable(command, handler))}}.

If this may cause an issue, the following wait may block.

{noformat}
            // read the results for the digest mismatch retries
            if (repairResponseHandlers != null)
            {
                for (int i = 0; i < repairCommands.size(); i++)
                {
                    ReadCommand command = repairCommands.get(i);
                    ReadCallback<ReadResponse, Row> handler = repairResponseHandlers.get(i);

                    Row row;
                    try
                    {
                        row = handler.get();
                    }
{noformat}

For two reasons.
* The data read request for local node may never sent out
* As one of the nodes is down (which triggered the Speculative Retry) will cause one missing
response.

*If missing two responses, this will block for 10 seconds*. 

*Issue 2* -- For _data repair_, {{RowDataResolver.resolve()}} has a similar issue as it calls
 {{scheduleRepairs()}} to send out  messages using sendRR() without distinguishing remote
nodes from the local node.

*Issue 3* -- When handling _data repair_, {{StorageProxy.fetchRows()}} blocks waiting for
acks to all of {{data repair}} requests sent out using sendRR(). This may cause the thread
to block.

For _data repair_ path, *data requests* are sent out and then compare / merge the received
responses; send out the merged / diff version and then block for acks.

How do we handle the case for _local node_? Does the sendRR() and the corresponding receive
part can handle the case for local node? If not, then this may block for 10 seconds.

{noformat}
            if (repairResponseHandlers != null)
            {
                for (int i = 0; i < repairCommands.size(); i++)
                {
                    ReadCommand command = repairCommands.get(i);
                    ReadCallback<ReadResponse, Row> handler = repairResponseHandlers.get(i);

                    Row row;
                    try
                    {
                        row = handler.get();
                    }
                    catch (DigestMismatchException e)
                    ...
                    RowDataResolver resolver = (RowDataResolver)handler.resolver;
                    try
                    {
                        // wait for the repair writes to be acknowledged, to minimize impact
on any replica that's
                        // behind on writes in case the out-of-sync row is read multiple times
in quick succession
                        FBUtilities.waitOnFutures(resolver.repairResults, DatabaseDescriptor.getWriteRpcTimeout());
                    }
                    catch (TimeoutException e)
                    {
                        Tracing.trace("Timed out on digest mismatch retries");
                        int blockFor = consistency_level.blockFor(Keyspace.open(command.getKeyspace()));
                        throw new ReadTimeoutException(consistency_level, blockFor, blockFor,
true);
                    }
{noformat}

*Question for waiting for the ack* -- Do we really need to wait for the ack?

We should assume the best effort approach, i.e. do the data repair and then return. No need
to block waiting for the acks for confirmation.

*Question for the Randomized approach* -- Since the end points are randomized, the first node
in the list is no likely the local node. This may cause a higher possibility of data repair.

In the *Randomized Approach*, the end points are reshuffled. Then, the first node in the list
used for _data read request_ is not likely the local node. If this node happens to be the
*DOWN* node, then, we end with all digest responses without the data, which will block and
eventually timed out.




was (Author: lizou):
Hello [~iamaleksey] and [~jbellis],

It appears to me that the testing results have suggested that the "_data read + speculative
retry_" path work as expected. This "_data read + speculative retry_" path has greatly minimized
the throughput impact caused by the failure of one of Cassandra server nodes.

The observed small degradation of throughput performance when _speculative retry_ is enabled
is very likely to be caused by the "*_read repair_*" path. I did the code reading of this
path last Friday and noticed some design / coding issues. I would like to discuss them with
you.

Please note that my code base is still the Cassandra 2.0.0 tarball, not updated with the latest
code changes.

*Issue 1* -- When handling {{DigestMismatchException}} in {{StorageProxy.fetchRows()}}, all
_data read requests_ are sent out using {{sendRR}} without distinguishing remote nodes from
the local node.

Will this cause an issue, as {{MessagingService.instance().sendRR()}} will send out enqueued
messages for a specified remote node via its pre-established TCP socket connection. For local
node, this should be done via {{LocalReadRunnable}}, i.e. {{StageManager.getStage(Stage.READ).execute(new
LocalReadRunnable(command, handler))}}.

If this may cause an issue, the following wait may block.

{noformat}
            // read the results for the digest mismatch retries
            if (repairResponseHandlers != null)
            {
                for (int i = 0; i < repairCommands.size(); i++)
                {
                    ReadCommand command = repairCommands.get(i);
                    ReadCallback<ReadResponse, Row> handler = repairResponseHandlers.get(i);

                    Row row;
                    try
                    {
                        row = handler.get();
                    }
{noformat}

For two reasons.
* The data read request for local node may never sent out
* As one of the nodes is down (which triggered the Speculative Retry) will cause one missing
response.

*If missing two responses, this will block for 10 seconds*. 

*Issue 2* -- For _data repair_, {{RowDataResolver.resolve()}} has a similar issue as it calls
 {{scheduleRepairs()}} to send out  messages using sendRR() without distinguishing remote
nodes from the local node.

*Issue 3* -- When handling _data repair_, {{StorageProxy.fetchRows()}} blocks waiting for
acks to all of {{data repair}} requests sent out using sendRR(). This may cause the thread
to block.

For _data repair_ path, *data requests* are sent out and then compare / merge the received
responses; send out the merged / diff version and then block for acks.

How do we handle the case for _local node_? Does the sendRR() and the corresponding receive
part can handle the case for local node? If not, then this may block for 10 seconds.

{noformat}
            if (repairResponseHandlers != null)
            {
                for (int i = 0; i < repairCommands.size(); i++)
                {
                    ReadCommand command = repairCommands.get(i);
                    ReadCallback<ReadResponse, Row> handler = repairResponseHandlers.get(i);

                    Row row;
                    try
                    {
                        row = handler.get();
                    }
                    catch (DigestMismatchException e)
                    ...
                    RowDataResolver resolver = (RowDataResolver)handler.resolver;
                    try
                    {
                        // wait for the repair writes to be acknowledged, to minimize impact
on any replica that's
                        // behind on writes in case the out-of-sync row is read multiple times
in quick succession
                        FBUtilities.waitOnFutures(resolver.repairResults, DatabaseDescriptor.getWriteRpcTimeout());
                    }
                    catch (TimeoutException e)
                    {
                        Tracing.trace("Timed out on digest mismatch retries");
                        int blockFor = consistency_level.blockFor(Keyspace.open(command.getKeyspace()));
                        throw new ReadTimeoutException(consistency_level, blockFor, blockFor,
true);
                    }
{noformat}

*Question for waiting for the ack* -- Do we really need to wait for the ack?

We should assume the best effort approach, i.e. do the data repair and then return. No need
to block waiting for the acks for confirmation.

*Question for the Randomized approach* -- Since the end points are randomized, the first node
in the list is no likely the local node. This may cause a higher possibility of data repair.

In the *Randomized Approach*, the end points are reshuffled. Then, the first node in the list
used for _data read request_ is not likely the local node. If this node happens to be the
*DOWN* node, then, we end with all digest responses without the data, which will trigger the
_data repair_ path.



> Speculative read performance data show unexpected results
> ---------------------------------------------------------
>
>                 Key: CASSANDRA-5932
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5932
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Ryan McGuire
>            Assignee: Aleksey Yeschenko
>             Fix For: 2.0.2
>
>         Attachments: 5932-6692c50412ef7d.png, 5932.ded39c7e1c2fa.logs.tar.gz, 5932.txt,
5933-128_and_200rc1.png, 5933-7a87fc11.png, 5933-logs.tar.gz, 5933-randomized-dsnitch-replica.2.png,
5933-randomized-dsnitch-replica.3.png, 5933-randomized-dsnitch-replica.png, compaction-makes-slow.png,
compaction-makes-slow-stats.png, eager-read-looks-promising.png, eager-read-looks-promising-stats.png,
eager-read-not-consistent.png, eager-read-not-consistent-stats.png, node-down-increase-performance.png
>
>
> I've done a series of stress tests with eager retries enabled that show undesirable behavior.
I'm grouping these behaviours into one ticket as they are most likely related.
> 1) Killing off a node in a 4 node cluster actually increases performance.
> 2) Compactions make nodes slow, even after the compaction is done.
> 3) Eager Reads tend to lessen the *immediate* performance impact of a node going down,
but not consistently.
> My Environment:
> 1 stress machine: node0
> 4 C* nodes: node4, node5, node6, node7
> My script:
> node0 writes some data: stress -d node4 -F 30000000 -n 30000000 -i 5 -l 2 -K 20
> node0 reads some data: stress -d node4 -n 30000000 -o read -i 5 -K 20
> h3. Examples:
> h5. A node going down increases performance:
> !node-down-increase-performance.png!
> [Data for this test here|http://ryanmcguire.info/ds/graph/graph.html?stats=stats.eager_retry.node_killed.just_20.json&metric=interval_op_rate&operation=stress-read&smoothing=1]
> At 450s, I kill -9 one of the nodes. There is a brief decrease in performance as the
snitch adapts, but then it recovers... to even higher performance than before.
> h5. Compactions make nodes permanently slow:
> !compaction-makes-slow.png!
> !compaction-makes-slow-stats.png!
> The green and orange lines represent trials with eager retry enabled, they never recover
their op-rate from before the compaction as the red and blue lines do.
> [Data for this test here|http://ryanmcguire.info/ds/graph/graph.html?stats=stats.eager_retry.compaction.2.json&metric=interval_op_rate&operation=stress-read&smoothing=1]
> h5. Speculative Read tends to lessen the *immediate* impact:
> !eager-read-looks-promising.png!
> !eager-read-looks-promising-stats.png!
> This graph looked the most promising to me, the two trials with eager retry, the green
and orange line, at 450s showed the smallest dip in performance. 
> [Data for this test here|http://ryanmcguire.info/ds/graph/graph.html?stats=stats.eager_retry.node_killed.json&metric=interval_op_rate&operation=stress-read&smoothing=1]
> h5. But not always:
> !eager-read-not-consistent.png!
> !eager-read-not-consistent-stats.png!
> This is a retrial with the same settings as above, yet the 95percentile eager retry (red
line) did poorly this time at 450s.
> [Data for this test here|http://ryanmcguire.info/ds/graph/graph.html?stats=stats.eager_retry.node_killed.just_20.rc1.try2.json&metric=interval_op_rate&operation=stress-read&smoothing=1]



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message