cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ben Slater <ben.sla...@instaclustr.com>
Subject Re: Does recovery continue after truncating a table?
Date Sat, 26 Nov 2016 06:13:00 GMT
Nice detective work! Seems to me that it’s a best an undocumented
limitation and potentially could be viewed as a bug - maybe log another
JIRA?

One node - there is a nodetool truncatehints command that could be used to
clear out the hints (
http://cassandra.apache.org/doc/latest/tools/nodetool/truncatehints.html?highlight=truncate)
.
However, it seems to clear all hints on particular endpoint, not just for a
specific table.

Cheers
Ben

On Fri, 25 Nov 2016 at 17:42 Yuji Ito <yuji@imagine-orb.com> wrote:

> Hi all,
>
> I revised the script to reproduce the issue.
> I think the issue happens more frequently than before.
> Killing another node is added to the previous script.
>
> ==== [script] ====
> #!/bin/sh
>
> node1_ip=<node1 IP address>
> node2_ip=<node2 IP address>
> node3_ip=<node3 IP address>
> node2_user=<user name>
> node3_user=<user name>
> rows=10000
>
> echo "consistency quorum;" > init_data.cql
> for key in $(seq 0 $(expr $rows - 1))
> do
>     echo "insert into testdb.testtbl (key, val) values($key, 1111) IF NOT
> EXISTS;" >> init_data.cql
>     done
>
>     while true
>     do
>     echo "truncate the table"
>     cqlsh $node1_ip -e "truncate table testdb.testtbl" > /dev/null 2>&1
>     if [ $? -ne 0 ]; then
>         echo "truncating failed"
>     continue
>     else
>         break
>     fi
> done
>
> echo "kill C* process on node3"
> pdsh -l $node3_user -R ssh -w $node3_ip "ps auxww | grep CassandraDaemon |
> awk '{if (\$13 ~ /cassand/) print \$2}' | xargs sudo kill -9"
>
> echo "insert $rows rows"
> cqlsh $node1_ip -f init_data.cql > insert_log 2>&1
>
> echo "restart C* process on node3"
> pdsh -l $node3_user -R ssh -w $node3_ip "sudo /etc/init.d/cassandra start"
>
> while true
> do
> echo "truncate the table again"
> cqlsh $node1_ip -e "truncate table testdb.testtbl"
> if [ $? -ne 0 ]; then
>     echo "truncating failed"
>         continue
> else
>     echo "truncation succeeded!"
>     break
> fi
> done
>
> echo "kill C* process on node2"
> pdsh -l $node2_user -R ssh -w $node2_ip "ps auxww | grep CassandraDaemon |
> awk '{if (\$13 ~ /cassand/) print \$2}' | xargs sudo kill -9"
>
> cqlsh $node1_ip --request-timeout 3600 -e "consistency serial; select
> count(*) from testdb.testtbl;"
> sleep 10
> cqlsh $node1_ip --request-timeout 3600 -e "consistency serial; select
> count(*) from testdb.testtbl;"
>
> echo "restart C* process on node2"
> pdsh -l $node2_user -R ssh -w $node2_ip "sudo /etc/init.d/cassandra start"
>
>
> Thanks,
> yuji
>
>
> On Fri, Nov 18, 2016 at 7:52 PM, Yuji Ito <yuji@imagine-orb.com> wrote:
>
> I investigated source code and logs of killed node.
> I guess that unexpected writes are executed when truncation is being
> executed.
>
> Some writes were executed after flush (the first flush) in truncation and
> these writes could be read.
> These writes were requested as MUTATION by another node for hinted handoff.
> Their data was stored to a new memtable and flushed (the second flush) to
> a new SSTable before snapshot in truncation.
> So, the truncation discarded only old SSTables, not the new SSTable.
> That's because ReplayPosition which was used for discarding SSTable was
> that of the first flush.
>
> I copied some parts of log as below.
> "##" line is my comment.
> The point is that the ReplayPosition is moved forward by the second flush.
> It means some writes are executed after the first flush.
>
> == log ==
> ## started truncation
> TRACE [SharedPool-Worker-16] 2016-11-17 08:36:04,612
> ColumnFamilyStore.java:2790 - truncating testtbl
> ## the first flush started before truncation
> DEBUG [SharedPool-Worker-16] 2016-11-17 08:36:04,612
> ColumnFamilyStore.java:952 - Enqueuing flush of testtbl: 591360 (0%)
> on-heap, 0 (0%) off-heap
> INFO  [MemtableFlushWriter:1] 2016-11-17 08:36:04,613 Memtable.java:352 -
> Writing Memtable-testtbl@1863835308(42.625KiB serialized bytes, 2816 ops,
> 0%/0% of on/off-heap limit)
> ...
> DEBUG [MemtableFlushWriter:1] 2016-11-17 08:36:04,973 Memtable.java:386 -
> Completed flushing
> /var/lib/cassandra/data/testdb/testtbl-562848f0a55611e68b1451065d58fdfb/tmp-lb-1-big-Data.db
> (17.651KiB) for commitlog position ReplayPosition(segmentId=1479371760395,
> position=315867)
> ## this ReplayPosition was used for discarding SSTables
> ...
> TRACE [MemtablePostFlush:1] 2016-11-17 08:36:05,022 CommitLog.java:298 -
> discard completed log segments for ReplayPosition(segmentId=1479371760395,
> position=315867), table 562848f0-a556-11e6-8b14-51065d58fdfb
> ## end of the first flush
> DEBUG [SharedPool-Worker-16] 2016-11-17 08:36:05,028
> ColumnFamilyStore.java:2823 - Discarding sstable data for truncated CF +
> indexes
> ## the second flush before snapshot
> DEBUG [SharedPool-Worker-16] 2016-11-17 08:36:05,028
> ColumnFamilyStore.java:952 - Enqueuing flush of testtbl: 698880 (0%)
> on-heap, 0 (0%) off-heap
> INFO  [MemtableFlushWriter:2] 2016-11-17 08:36:05,029 Memtable.java:352 -
> Writing Memtable-testtbl@1186728207(50.375KiB serialized bytes, 3328 ops,
> 0%/0% of on/off-heap limit)
> ...
> DEBUG [MemtableFlushWriter:2] 2016-11-17 08:36:05,258 Memtable.java:386 -
> Completed flushing
> /var/lib/cassandra/data/testdb/testtbl-562848f0a55611e68b1451065d58fdfb/tmp-lb-2-big-Data.db
> (17.696KiB) for commitlog position ReplayPosition(segmentId=1479371760395,
> position=486627)
> ...
> TRACE [MemtablePostFlush:1] 2016-11-17 08:36:05,289 CommitLog.java:298 -
> discard completed log segments for ReplayPosition(segmentId=1479371760395,
> position=486627), table 562848f0-a556-11e6-8b14-51065d58fdfb
> ## end of the second flush: position was moved
> ...
> ## only old SSTable was deleted because this SSTable was older than
> ReplayPosition(segmentId=1479371760395, position=315867)
> TRACE [NonPeriodicTasks:1] 2016-11-17 08:36:05,303 SSTable.java:118 -
> Deleted
> /var/lib/cassandra/data/testdb/testtbl-562848f0a55611e68b1451065d58fdfb/lb-1-big
> ...
> TRACE [SharedPool-Worker-16] 2016-11-17 08:36:05,320
> ColumnFamilyStore.java:2841 - truncate complete
> TRACE [SharedPool-Worker-16] 2016-11-17 08:36:05,320
> TruncateVerbHandler.java:53 - Truncation(keyspace='testdb', cf='testtbl')
> applied.  Enqueuing response to 36512@/10.91.145.7
> TRACE [SharedPool-Worker-16] 2016-11-17 08:36:05,320
> MessagingService.java:728 - /10.91.145.27 sending REQUEST_RESPONSE to
> 36512@/10.91.145.7
> ## end of truncation
> ====
>
> Actually, "truncated_at" of the table on the system.local after running
> the script was 0x00000158716da30b0004d1db00000158716db524.
> It means segmentId=1479371760395, position=315867
> truncated_at=1479371765028 (2016-11-17 08:36:05,028)
>
> thanks,
> yuji
>
>
> On Wed, Nov 16, 2016 at 5:25 PM, Yuji Ito <yuji@imagine-orb.com> wrote:
>
> Hi,
>
> I could find stale data after truncating a table.
> It seems that truncating starts while recovery is being executed just
> after a node restarts.
> After the truncating finishes, recovery still continues?
> Is it expected?
>
> I use C* 2.2.8 and can reproduce it as below.
>
> ==== [create table] ====
> cqlsh $ip -e "drop keyspace testdb;"
> cqlsh $ip -e "CREATE KEYSPACE testdb WITH replication = {'class':
> 'SimpleStrategy', 'replication_factor': '3'};"
> cqlsh $ip -e "CREATE TABLE testdb.testtbl (key int PRIMARY KEY, val int);"
>
> ==== [script] ====
> #!/bin/sh
>
> node1_ip=<node1 IP address>
> node2_ip=<node2 IP address>
> node3_ip=<node3 IP address>
> node3_user=<user name>
> rows=10000
>
> echo "consistency quorum;" > init_data.cql
> for key in $(seq 0 $(expr $rows - 1))
> do
>     echo "insert into testdb.testtbl (key, val) values($key, 1111) IF NOT
> EXISTS;" >> init_data.cql
> done
>
> while true
> do
> echo "truncate the table"
> cqlsh $node1_ip -e "truncate table testdb.testtbl"
> if [ $? -ne 0 ]; then
>     echo "truncating failed"
>     continue
> else
>     break
> fi
> done
>
> echo "kill C* process on node3"
> pdsh -l $node3_user -R ssh -w $node3_ip "ps auxww | grep CassandraDaemon |
> awk '{if (\$13 ~ /cassand/) print \$2}' | xargs sudo kill -9"
>
> echo "insert $rows rows"
> cqlsh $node1_ip -f init_data.cql > insert_log 2>&1
>
> echo "restart C* process on node3"
> pdsh -l $node3_user -R ssh -w $node3_ip "sudo /etc/init.d/cassandra start"
>
> while true
> do
> echo "truncate the table again"
> cqlsh $node1_ip -e "truncate table testdb.testtbl"
> if [ $? -ne 0 ]; then
>     echo "truncating failed"
>     continue
> else
>     break
> fi
> done
>
> cqlsh $node1_ip --request-timeout 3600 -e "consistency serial; select
> count(*) from testdb.testtbl;"
> sleep 10
> cqlsh $node1_ip --request-timeout 3600 -e "consistency serial; select
> count(*) from testdb.testtbl;"
>
>
> ==== [result] ====
> truncate the table
> kill C* process on node3
> insert 10000 rows
> restart C* process on node3
> 10.91.145.27: Starting Cassandra: OK
> truncate the table again
> <stdin>:1:TruncateError: Error during truncate: Cannot achieve consistency
> level ALL
> truncating failed
> truncate the table again
> <stdin>:1:TruncateError: Error during truncate: Cannot achieve consistency
> level ALL
> truncating failed
> truncate the table again
> <stdin>:1:TruncateError: Error during truncate: Cannot achieve consistency
> level ALL
> truncating failed
> truncate the table again
> <stdin>:1:TruncateError: Error during truncate: Cannot achieve consistency
> level ALL
> truncating failed
> truncate the table again
> <stdin>:1:TruncateError: Error during truncate: Cannot achieve consistency
> level ALL
> truncating failed
> truncate the table again
> <stdin>:1:TruncateError: Error during truncate: Cannot achieve consistency
> level ALL
> truncating failed
> truncate the table again
> Consistency level set to SERIAL.
>
>  count
> -------
>    300
>
> (1 rows)
>
> Warnings :
> Aggregation query used without partition key
>
> Consistency level set to SERIAL.
>
>  count
> -------
>   2304
>
> (1 rows)
>
> Warnings :
> Aggregation query used without partition key
> ====
>
> I found it when I was investigating data lost problem. (Ref. "failure node
> rejoin" thread)
> I'm not sure this problem is related to data lost.
>
> Thanks,
> yuji
>
>
>
>

Mime
View raw message