rcoli helped me investigate this issue. The mystery was that the segment of commit log was probably not fsynced to disk since the setting was set to periodic with 10 second delay and CRC32 checksum validation failed skipping the reply, so what happened in my scenario can be explained by this. I am going to change our settings to batch mode.
I was restarting Cassandra nodes again today. 1 hour later my support team let me know that a customer has reported some missing data. I suppose this is the same issue. The application logs show that our client got success from the Thrift log and proceeded with responding to the user and I could grep the commit log for a missing record like I did before.We have durable writes enabled. To me, it seams like when stuff are in memtables and hasn't been flushed to disk, when I restart the node, the commit log doesn't get replayed correctly.
Please advice.On Thu, Sep 27, 2012 at 2:43 PM, Arya Goudarzi <firstname.lastname@example.org> wrote:
Thanks for your reply. I did grep on the commit logs for the offending key and grep showed Binary file matches. I am trying to use this tool to extract the commitlog and actually confirm if the mutation was a write:
On Thu, Sep 27, 2012 at 1:45 AM, Sylvain Lebresne <email@example.com> wrote:
> I can verify the existence of the key that was inserted in Commitlogs of both replicas however it seams that this record was never inserted.Out of curiosity, how can you verify that?