cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeremy Hanna <>
Subject Re: 4/20 nodes get disproportionate amount of mutations
Date Thu, 25 Aug 2011 08:53:36 GMT
As somewhat of a conclusion to this thread, we have resolved the major issue having to do with
the hotspots.  We were balanced between availability zones in aws/ec2 us-east - a,b,c with
the number of nodes in our cluster.  However we didn't alternate by rack with the token order.
 We are using the property file snitch and had defined each node as being part of a single
DC (for now) and a rack (a-c).  But we should have made the tokens in order go from a to b
to c with their rack.

So as an example of alternating:
<token1>: <node in rack b>
<token2>: <node in rack a>
<token3>: <node in rack c>
<token4>: <node in rack b>

So we took some time and shifted things around and things appear to be much more spread out
across the cluster.

As a side note: how we discovered the root of the problem.  We kept poring over logs and stats
but it helped to trial DataStax's OpsCenter product to get a view of the cluster.  All that
info is available via jmx and we could have mapped out node replication ourselves or with
another monitoring product.  However that's what helped us.  That along with Brandon and others
from the community helped us discover the reason.  Thanks again.

The reason why the order matters - currently when replicating I believe NetworkTopologyStrategy
uses a pattern of choosing local for the first replica, in-rack for second replica, and off-rack
for the next replica (depending on replication factor).  It appears though that when choosing
the non-local replicas, it looks for the next token in the ring of the same rack and the next
token of a different rack (depending on which it is looking for).  So that is why alternating
by rack is important.  That might be able to be smarter in the future which would be nice
- to not have to care and let Cassandra spread the replication around intelligently.

On Aug 23, 2011, at 6:02 AM, Jeremy Hanna wrote:

> On Aug 23, 2011, at 3:43 AM, aaron morton wrote:
>> Dropped messages in ReadRepair is odd. Are you also dropping mutations ? 
>> There are two tasks performed on the ReadRepair stage. The digests are compared on
this stage, and secondly the repair happens on the stage. Comparing digests is quick. Doing
the repair could take a bit longer, all the cf's returned are collated, filtered and deletes
>> We don't do background Read Repair on range scans, they do have foreground digest
checking though.
>> What CL are you using ? 
> CL.ONE for hadoop writes, CL.QUORUM for hadoop reads
>> begin crazy theory:
>> 	Could there be a very big row that is out of sync ? The increased RR would be resulting
in mutations been sent back to the replicas. Which would give you a hot spot in mutations.
>> 	Check max compacted row size on the hot nodes. 
>> 	Turn the logging up to DEBUG on the hot machines for o.a.c.service.RowRepairResolver
and look for the "resolve:…" message it has the time taken.
> The max compacted size didn't seem unreasonable - about a MB.  I turned up logging to
DEBUG for that class and I get plenty of dropped READ_REPAIR messages, but nothing coming
out of DEBUG in the logs to indicate the time taken that I can see.
>> Cheers
>> -----------------
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> On 23/08/2011, at 7:52 PM, Jeremy Hanna wrote:
>>> On Aug 23, 2011, at 2:25 AM, Peter Schuller wrote:
>>>>> We've been having issues where as soon as we start doing heavy writes
(via hadoop) recently, it really hammers 4 nodes out of 20.  We're using random partitioner
and we've set the initial tokens for our 20 nodes according to the general spacing formula,
except for a few token offsets as we've replaced dead nodes.
>>>> Is the hadoop job iterating over keys in the cluster in token order
>>>> perhaps, and you're generating writes to those keys? That would
>>>> explain a "moving hotspot" along the cluster.
>>> Yes - we're iterating over all the keys of particular column families, doing
joins using pig as we enrich and perform measure calculations.  When we write, we're usually
writing out for a certain small subset of keys which shouldn't have hotspots with RandomPartitioner
>>>> -- 
>>>> / Peter Schuller (@scode on twitter)

View raw message