incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <aa...@thelastpickle.com>
Subject Re: Serious issue updating Cassandra version and topology
Date Tue, 10 Jul 2012 09:05:18 GMT
To be clear, this happened on a 1.1.2 node and it happened again *after* you had run a scrub
? 

Has this cluster been around for a while or was the data created with 1.1 ?

Can you confirm that all sstables were re-written for the CF? Check the timestamp on the files.
Also also files should have the same version, the -h?- part of the name.

Can you repair the other CF's ? 

If this cannot be repaired by scrub or upgradetables you may need to cut the row out of the
sstables. Using sstable2json and json2sstable. 

 
Cheers
 
-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 8/07/2012, at 4:05 PM, Michael Theroux wrote:

> Hello,
> 
> We're in the process of trying to move a 6-node cluster from RF=1 to RF=3. Once our replication
factor was upped to 3, we ran nodetool repair, and immediately hit an issue on the first node
we ran repair on:
> 
>  INFO 03:08:51,536 Starting repair command #1, repairing 2 ranges.
>  INFO 03:08:51,552 [repair #3e724fe0-c8aa-11e1-0000-4f728ab9d6ff] new session: will sync
xxx-xx-xx-xxx-132.compute-1.amazonaws.com/10.202.99.101, /10.29.187.61 on range (Token(bytes[d5555555555555555555555555555558]),Token(bytes[00000000000000000000000000000000])]
for xxxxx.[aaaaa, bbbbb, ccccc, ddddd, eeeee, fffff, ggggg, hhhhh, iiiii, jjjjj, kkkkk, lllll,
mmmmm, nnnnn, ooooo, ppppp, qqqqq, rrrrr, sssss]
>  INFO 03:08:51,555 [repair #3e724fe0-c8aa-11e1-0000-4f728ab9d6ff] requesting merkle trees
for aaaaa (to [/10.29.187.61, xxx-xx-xx-xxx-compute-1.amazonaws.com/10.202.99.101])
>  INFO 03:08:52,719 [repair #3e724fe0-c8aa-11e1-0000-4f728ab9d6ff] Received merkle tree
for aaaaa from /10.29.187.61
>  INFO 03:08:53,518 [repair #3e724fe0-c8aa-11e1-0000-4f728ab9d6ff] Received merkle tree
for aaaaa from xxx-xx-xx-xxx-.compute-1.amazonaws.com/10.202.99.101
>  INFO 03:08:53,519 [repair #3e724fe0-c8aa-11e1-0000-4f728ab9d6ff] requesting merkle trees
for bbbbb (to [/10.29.187.61, xxx-xx-xx-xxx-132.compute-1.amazonaws.com/10.202.99.101])
>  INFO 03:08:53,639 [repair #3e724fe0-c8aa-11e1-0000-4f728ab9d6ff] Endpoints /10.29.187.61
and xxx-xx-xx-xxx-132.compute-1.amazonaws.com/10.202.99.101 are consistent for aaaaa
>  INFO 03:08:53,640 [repair #3e724fe0-c8aa-11e1-0000-4f728ab9d6ff] aaaaa is fully synced
(18 remaining column family to sync for this session)
>  INFO 03:08:54,049 [repair #3e724fe0-c8aa-11e1-0000-4f728ab9d6ff] Received merkle tree
for bbbbb from /10.29.187.61
> ERROR 03:09:09,440 Exception in thread Thread[ValidationExecutor:1,1,main]
> java.lang.AssertionError: row DecoratedKey(Token(bytes[efd5654ce92a705b14244e2f5f73ab98c3de2f66c7adbd71e0e893997e198c47]),
efd5654ce92a705b14244e2f5f73ab98c3de2f66c7adbd71e0e893997e198c47) received out of order wrt
DecoratedKey(Token(bytes[f33a5ad4a45e8cac7987737db246ddfe9294c95bea40f411485055f5dbecbadb]),
f33a5ad4a45e8cac7987737db246ddfe9294c95bea40f411485055f5dbecbadb)
> 	at org.apache.cassandra.service.AntiEntropyService$Validator.add(AntiEntropyService.java:349)
> 	at org.apache.cassandra.db.compaction.CompactionManager.doValidationCompaction(CompactionManager.java:712)
> 	at org.apache.cassandra.db.compaction.CompactionManager.access$600(CompactionManager.java:68)
> 	at org.apache.cassandra.db.compaction.CompactionManager$8.call(CompactionManager.java:438)
> 	at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
> 	at java.util.concurrent.FutureTask.run(Unknown Source)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> 	at java.lang.Thread.run(Unknown Source)
> 
> It looks from the log above, the sync of the "aaaaa" column family was successful.  However,
the "bbbbb" column family resulted in this error.  In addition, the repair hung after this
error.  We ran node tool scrub on all nodes and invalidated the key and row caches and tried
again (with RF=2), and it didn't help alleviate the problem.
> 
> Some other important pieces of information:
> We use ByteOrderedPartitioner (we MD5 hash the keys ourselves)
> We're using Leveled Compaction
> As we're in the middle of a transition, one node is on 1.1.2 (the one we tried repair
on), the other 5 are on 1.1.1
> 
> Thanks,
> -Mike
> 


Mime
View raw message