cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shivaram Venkataraman (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-4261) [patch] Support consistency-latency prediction in nodetool
Date Tue, 10 Jul 2012 17:26:35 GMT


Shivaram Venkataraman commented on CASSANDRA-4261:

We've posted an updated patch:

1.) {{PBSPredictor}} is now tunable via JMX. Latency collection can be enabled/disabled by
calling {{enableConsistencyPredictionLogging()}}/{{disableConsistencyPredictionLogging()}}
respectively. Also the number of latencies collected can be tuned by calling {{setNumberTrialsForConsistencyPrediction}}.

2.) We performed benchmarking and optimized our logging to minimize overhead.

We ran load tests by sending 1M queries using cassandra-stress on a EC2 with 4 m1.large instances
with ephemeral storage formatted as XFS and a replication factor of three. We've posted more
details and the scripts used for benchmarking at

We compared three setups:
1.) "trunk": without this patch
2.) "no-pbs": patch applied but consistency prediction logging disabled 
3.) "pbs": patch applied and logging enabled

We tested with ConsistencyLevel=ONE (R1W1) and ConsistencyLevel=QUORUM (R2W2). The average
latency (ms) and standard deviation across five trials are below:

R1W1 - Insert:
|trunk | 10.31 | 0.090 |
|no-pbs| 10.58 | 0.092 |
|pbs   | 11.21 | 0.107 |

R1W1 - Read:
|trunk  | 9.11 | 0.067 |
|no-pbs | 9.13 | 0.044 |
|pbs    | 9.27 | 0.015 |

R2W2 - Insert:
|trunk  | 12.36 | 0.028 |
|no-pbs | 12.44 | 0.072 |
|pbs    | 13.21 | 0.068 |

R2W2 - Read
|trunk  | 12.41 | 0.136 |
|no-pbs | 12.56 | 0.054 |
|pbs    | 12.79 | 0.099 |

The latency overhead for inserts is around 0.9ms when PBS is turned on for both R1W1 and R2W2.
We believe this is primarily due to the overhead of calling System.currentTimeMillis() for
the start, finish of each message and also due to the overhead of 50 stress threadsinserting
latency information in the ConcurrentHashMap.

The overhead is around 0.2ms per query when PBS logging is turned off (max 1.65% overhead).
This is because even though the logging is turned off, the creation time of each message is
serialized in and We can optimize this by adding an extra
flag to the wire protocol and optionally sending the timestamp based on a flag (also configurable
by JMX) if you prefer.

Finally, the overhead for reads is lower than that for writes because reads are only sent
to the nearest nodes and sending local messages avoids the PBS latency collection code path.
> [patch] Support consistency-latency prediction in nodetool
> ----------------------------------------------------------
>                 Key: CASSANDRA-4261
>                 URL:
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Tools
>    Affects Versions: 1.2
>            Reporter: Peter Bailis
>         Attachments:, pbs-nodetool-v2.patch, pbs-nodetool-v3.patch
> h3. Introduction
> Cassandra supports a variety of replication configurations: ReplicationFactor is set
per-ColumnFamily and ConsistencyLevel is set per-request. Setting {{ConsistencyLevel}} to
{{QUORUM}} for reads and writes ensures strong consistency, but {{QUORUM}} is often slower
than {{ONE}}, {{TWO}}, or {{THREE}}. What should users choose?
> This patch provides a latency-consistency analysis within {{nodetool}}. Users can accurately
predict Cassandra's behavior in their production environments without interfering with performance.
> What's the probability that we'll read a write t seconds after it completes? What about
reading one of the last k writes? This patch provides answers via {{nodetool predictconsistency}}:
> {{nodetool predictconsistency ReplicationFactor TimeAfterWrite Versions}}
> \\ \\
> {code:title=Example output|borderStyle=solid}
> //N == ReplicationFactor
> //R == read ConsistencyLevel
> //W == write ConsistencyLevel
> user@test:$ nodetool predictconsistency 3 100 1
> Performing consistency prediction
> 100ms after a given write, with maximum version staleness of k=1
> N=3, R=1, W=1
> Probability of consistent reads: 0.678900
> Average read latency: 5.377900ms (99.900th %ile 40ms)
> Average write latency: 36.971298ms (99.900th %ile 294ms)
> N=3, R=1, W=2
> Probability of consistent reads: 0.791600
> Average read latency: 5.372500ms (99.900th %ile 39ms)
> Average write latency: 303.630890ms (99.900th %ile 357ms)
> N=3, R=1, W=3
> Probability of consistent reads: 1.000000
> Average read latency: 5.426600ms (99.900th %ile 42ms)
> Average write latency: 1382.650879ms (99.900th %ile 629ms)
> N=3, R=2, W=1
> Probability of consistent reads: 0.915800
> Average read latency: 11.091000ms (99.900th %ile 348ms)
> Average write latency: 42.663101ms (99.900th %ile 284ms)
> N=3, R=2, W=2
> Probability of consistent reads: 1.000000
> Average read latency: 10.606800ms (99.900th %ile 263ms)
> Average write latency: 310.117615ms (99.900th %ile 335ms)
> N=3, R=3, W=1
> Probability of consistent reads: 1.000000
> Average read latency: 52.657501ms (99.900th %ile 565ms)
> Average write latency: 39.949799ms (99.900th %ile 237ms)
> {code}
> h3. Demo
> Here's an example scenario you can run using [ccm|]. The
prediction is fast:
> {code:borderStyle=solid}
> cd <cassandra-source-dir with patch applied>
> ant
> # turn on consistency logging
> sed -i .bak 's/log_latencies_for_consistency_prediction: false/log_latencies_for_consistency_prediction:
true/' conf/cassandra.yaml
> ccm create consistencytest --cassandra-dir=. 
> ccm populate -n 5
> ccm start
> # if start fails, you might need to initialize more loopback interfaces
> # e.g., sudo ifconfig lo0 alias
> # use stress to get some sample latency data
> tools/bin/stress -d -l 3 -n 10000 -o insert
> tools/bin/stress -d -l 3 -n 10000 -o read
> bin/nodetool -h -p 7100 predictconsistency 3 100 1
> {code}
> h3. What and Why
> We've implemented [Probabilistically Bounded Staleness|],
a new technique for predicting consistency-latency trade-offs within Cassandra. Our [paper|]
will appear in [VLDB 2012|], and, in it, we've used PBS to profile
a range of Dynamo-style data store deployments at places like LinkedIn and Yammer in addition
to profiling our own Cassandra deployments. In our experience, prediction is both accurate
and much more lightweight than profiling and manually testing each possible replication configuration
(especially in production!).
> This analysis is important for the many users we've talked to and heard about who use
"partial quorum" operation (e.g., non-{{QUORUM}} {{ConsistencyLevel}}). Should they use CL={{ONE}}?
CL={{TWO}}? It likely depends on their runtime environment and, short of profiling in production,
there's no existing way to answer these questions. (Keep in mind, Cassandra defaults to CL={{ONE}},
meaning users don't know how stale their data will be.)
> We outline limitations of the current approach after describing how it's done. We believe
that this is a useful feature that can provide guidance and fairly accurate estimation for
most users.
> h3. Interface
> This patch allows users to perform this prediction in production using {{nodetool}}.
> Users enable tracing of latency data by setting {{log_latencies_for_consistency_prediction:
true}} in {{cassandra.yaml}}.
> Cassandra logs {{max_logged_latencies_for_consistency_prediction}} latencies. Each latency
is 8 bytes, and there are 4 distributions we require, so the space overhead is {{32*logged_latencies}}
bytes of memory for the predicting node.
> {{nodetool predictconsistency}} predicts the latency and consistency for each possible
{{ConsistencyLevel}} setting (reads and writes) by running {{number_trials_for_consistency_prediction}}
Monte Carlo trials per configuration.
> Users shouldn't have to touch these parameters, and the defaults work well. The more
latencies they log, the better the predictions will be.
> h3. Implementation
> This patch is fairly lightweight, requiring minimal changes to existing code. The high-level
overview is that we gather empirical latency distributions then perform Monte Carlo analysis
using the gathered data.
> h4. Latency Data
> We log latency data in {{service.PBSPredictor}}, recording four relevant distributions:
>  * *W*: time from when the coordinator sends a mutation to the time that a replica begins
to serve the new value(s)
>  * *A*: time from when a replica accepting a mutation sends an
>  * *R*: time from when the coordinator sends a read request to the time that the replica
performs the read
> * *S*: time from when the replica sends a read response to the time when the coordinator
receives it
> We augment {{net.MessageIn}} and {{net.MessageOut}} to store timestamps along with every
message (8 bytes overhead required for millisecond {{long}}). In {{net.MessagingService}},
we log the start of every mutation and read, and, in {{net.ResponseVerbHandler}}, we log the
end of every mutation and read. Jonathan Ellis mentioned that [1123|]
had similar latency tracing, but, as far as we can tell, these latencies aren't in that patch.
We use an LRU policy to bound the number of latencies we track for each distribution.
> h4. Prediction
> When prompted by {{nodetool}}, we call {{service.PBSPredictor.doPrediction}}, which performs
the actual Monte Carlo analysis based on the provided data. It's straightforward, and we've
commented this analysis pretty well but can elaborate more here if required.
> h4. Testing
> We've modified the unit test for {{SerializationsTest}} and provided a new unit test
for {{PBSPredictor}} ({{PBSPredictorTest}}). You can run the {{PBSPredictor}} test with {{ant
> h4. Overhead
> This patch introduces 8 bytes of overhead per message. We could reduce this overhead
and add timestamps on-demand, but this would require changing {{net.MessageIn}} and {{net.MessageOut}}
serialization at runtime, which is messy.
> If enabled, consistency tracing requires {{32*logged_latencies}} bytes of memory on the
node on which tracing is enabled.
> h3. Caveats
>  The predictions are conservative, or worst-case, meaning we may predict more staleness
than in practice in the following ways:
>  * We do not account for read repair. 
>  * We do not account for Merkle tree exchange.
>  * Multi-version staleness is particularly conservative.
>  * We simulate non-local reads and writes. We assume that the coordinating Cassandra
node is not itself a replica for a given key.
> The predictions are optimistic in the following ways:
>  * We do not predict the impact of node failure.
>  * We do not model hinted handoff.
> Predictions are only as good as the collected latencies. Generally, the more latencies
that are collected, the better, but if the environment or workload changes, things might change.
Also, we currently don't distinguish between column families or value sizes. This is doable,
but it adds complexity to the interface and possibly more storage overhead.
> Finally, for accurate results, we require replicas to have synchronized clocks (Cassandra
requires this from clients anyway). If clocks are skewed/out of sync, this will bias predictions
by the magnitude of the skew.
> We can potentially improve these if there's interest, but this is an area of active research.
> ----
> Peter Bailis and Shivaram Venkataraman
> [|]
> [|]

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


View raw message