This has happened before the save caches files were not compatible between 0.6 and 0.7. I have ran into this a couple other times before. The good news is the save key cache is just an optimization, you can blow it away and it is not usually a big deal.




On Fri, Apr 5, 2013 at 2:55 PM, Arya Goudarzi <goudarzi@gmail.com> wrote:
Here is a chunk of bloom filter sstable skip messages from the node I enabled DEBUG on:

DEBUG [OptionalTasks:1] 2013-04-04 02:44:01,450 SSTableReader.java (line 737) Bloom filter allows skipping sstable 39459
DEBUG [OptionalTasks:1] 2013-04-04 02:44:01,450 SSTableReader.java (line 737) Bloom filter allows skipping sstable 39483
DEBUG [OptionalTasks:1] 2013-04-04 02:44:01,450 SSTableReader.java (line 737) Bloom filter allows skipping sstable 39332
DEBUG [OptionalTasks:1] 2013-04-04 02:44:01,450 SSTableReader.java (line 737) Bloom filter allows skipping sstable 39335
DEBUG [OptionalTasks:1] 2013-04-04 02:44:01,450 SSTableReader.java (line 737) Bloom filter allows skipping sstable 39438
DEBUG [OptionalTasks:1] 2013-04-04 02:44:01,450 SSTableReader.java (line 737) Bloom filter allows skipping sstable 39478
DEBUG [OptionalTasks:1] 2013-04-04 02:44:01,450 SSTableReader.java (line 737) Bloom filter allows skipping sstable 39456
DEBUG [OptionalTasks:1] 2013-04-04 02:44:01,450 SSTableReader.java (line 737) Bloom filter allows skipping sstable 39469
DEBUG [OptionalTasks:1] 2013-04-04 02:44:01,451 SSTableReader.java (line 737) Bloom filter allows skipping sstable 39334
DEBUG [OptionalTasks:1] 2013-04-04 02:44:01,451 SSTableReader.java (line 737) Bloom filter allows skipping sstable 39406

This is the last chunk of log before C* gets stuck, right before I stop the process, remove key caches and start again (This is from another node that I upgraded 2 days ago):

INFO [SSTableBatchOpen:2] 2013-04-03 01:59:39,769 SSTableReader.java (line 166) Opening /var/lib/cassandra/data/cardspring_production/UniqueIndexes/cardspring_production-UniqueIndexes-hf-316499 (5273270 bytes)
 INFO [SSTableBatchOpen:1] 2013-04-03 01:59:39,858 SSTableReader.java (line 166) Opening /var/lib/cassandra/data/cardspring_production/UniqueIndexes/cardspring_production-UniqueIndexes-hf-314755 (5264359 bytes)
 INFO [SSTableBatchOpen:2] 2013-04-03 01:59:39,894 SSTableReader.java (line 166) Opening /var/lib/cassandra/data/cardspring_production/UniqueIndexes/cardspring_production-UniqueIndexes-hf-314762 (5260887 bytes)
 INFO [SSTableBatchOpen:1] 2013-04-03 01:59:39,980 SSTableReader.java (line 166) Opening /var/lib/cassandra/data/cardspring_production/UniqueIndexes/cardspring_production-UniqueIndexes-hf-315886 (5262864 bytes)
 INFO [OptionalTasks:1] 2013-04-03 01:59:40,298 AutoSavingCache.java (line 112) reading saved cache /var/lib/cassandra/saved_caches/cardspring_production-UniqueIndexes-KeyCache


I finally upgrade all 12 nodes in our test environment yesterday. This issue seemed to exists on 7 nodes out of 12 nodes. They didn't alway get stuck on the same CF loading its saved KeyCache.


On Fri, Apr 5, 2013 at 9:56 AM, aaron morton <aaron@thelastpickle.com> wrote:
skipping sstable due to bloom filter debug messages
What were these messages?

Do you have the logs from the start up ? 

Cheers

-----------------
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton

On 4/04/2013, at 6:11 AM, Arya Goudarzi <goudarzi@gmail.com> wrote:

Hi,

I have upgraded 2 nodes out of a 12 mode test cluster from 1.1.10 to 1.2.3. During startup while tailing C*'s system.log, I observed a series of SSTable batch load messages and skipping sstable due to bloom filter debug messages which is normal for startup, but when it reached loading saved key caches, it gets stuck forever. The I/O wait stays high in the CPU graph and I/O ops are sent to disk, but C* never passes that step of loading the key cache file successfully. The saved key cache file was about 75MB on one node and 125MB on the other node and they were for different CFs. 

<image.jpeg>

The CPU I/O wait constantly stayed at 40%~ while system.log was stuck at loading one saved key cache file. I have marked that on the graph above. The workaround was to delete the saved cache files and things loaded fine (See marked Normal Startup). 

These machines are m1.xlarge EC2 instances. And this issue happened on both nodes upgraded. This did not happen during exercise of upgrade from 1.1.6 to 1.2.2 using the same snapshot. 

Should I raise a JIRA? 

-Arya