incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <aa...@thelastpickle.com>
Subject Re: Read-repair working, repair not working?
Date Sun, 10 Feb 2013 20:12:59 GMT
> I’d request data, nothing would be returned, I would then re-request the data and it
would correctly be returned:
> 
What CL are you using for reads and writes?

> I see a number of dropped ‘MUTATION’ operations : just under 5% of the total ‘MutationStage’
count.
> 
Dropped mutations in a multi DC setup may be a sign of network congestion or overloaded nodes.



> -          Could anybody suggest anything specific to look at to see why the repair operations
aren’t having the desired effect? 
> 
I would first build a test case to ensure correct operation when using strong consistency.
i.e. QUOURM write and read. Because you are using RF 2 per DC I assume you are not using LOCAL_QUOURM
because that is 2 and you would not have any redundancy in the DC. 

 
> 
> -          Would increasing logging level to ‘DEBUG’ show read-repair activity (to
confirm that this is happening, when & for what proportion of total requests)?
It would, but the INFO logging for the AES is pretty good. I would hold off for now. 

> 
> -          Is there something obvious that I could be missing here?
When a new AES session starts it logs this

            logger.info(String.format("[repair #%s] new session: will sync %s on range %s
for %s.%s", getName(), repairedNodes(), range, tablename, Arrays.toString(cfnames)));

When it completes it logs this

logger.info(String.format("[repair #%s] session completed successfully", getName()));

Or this on failure 

logger.error(String.format("[repair #%s] session completed with the following error", getName()),
exception);


Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 10/02/2013, at 9:56 PM, Brian Fleming <bigbrianfleming@gmail.com> wrote:

> 
>  
> 
> Hi,
> 
>  
> 
> I have a 20 node cluster running v1.0.7 split between 5 data centres, each with an RF
of 2, containing a ~1TB unique dataset/~10TB of total data. 
> 
>  
> 
> I’ve had some intermittent issues with a new data centre (3 nodes, RF=2) I brought
online late last year with data consistency & availability: I’d request data, nothing
would be returned, I would then re-request the data and it would correctly be returned: i.e.
read-repair appeared to be occurring.  However running repairs on the nodes didn’t resolve
this (I tried general ‘repair’ commands as well as targeted keyspace commands) – this
didn’t alter the behaviour.
> 
>  
> 
> After a lot of fruitless investigation, I decided to wipe & re-install/re-populate
the nodes.  The re-install & repair operations are now complete: I see the expected amount
of data on the nodes, however I am still seeing the same behaviour, i.e. I only get data after
one failed attempt.
> 
>  
> 
> When I run repair commands, I don’t see any errors in the logs. 
> 
> I see the expected ‘AntiEntropySessions’ count in ‘nodetool tpstats’ during repair
sessions.
> 
> I see a number of dropped ‘MUTATION’ operations : just under 5% of the total ‘MutationStage’
count.
> 
>  
> 
> Questions :
> 
> -          Could anybody suggest anything specific to look at to see why the repair operations
aren’t having the desired effect? 
> 
> -          Would increasing logging level to ‘DEBUG’ show read-repair activity (to
confirm that this is happening, when & for what proportion of total requests)?
> 
> -          Is there something obvious that I could be missing here?
> 
>  
> 
> Many thanks,
> 
> Brian
> 
>  
> 


Mime
View raw message