cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yuji Ito <y...@imagine-orb.com>
Subject Re: Does recovery continue after truncating a table?
Date Mon, 28 Nov 2016 01:08:40 GMT
Thanks Ben and Hiro,

I've reported it at https://issues.apache.org/jira/browse/CASSANDRA-12960.

I'll use `truncatehints` or DROP command after this.


On Sun, Nov 27, 2016 at 12:33 PM, Ben Slater <ben.slater@instaclustr.com>
wrote:

> By “undocumented limitation”, I meant “TRUNCATE” is mainly only used in
> development and testing, not production scenarios so a sufficient fix (and
> certainly a better than nothing fix) might be just to document that if you
> issue a TRUNCATE while there are still hinted hand-offs pending the hinted
> hand-offs replayed after the truncate will come back to life. Of course, an
> actual fix would be better.
>
> Cheers
> Ben
>
> On Sat, 26 Nov 2016 at 21:08 Hiroyuki Yamada <mogwaing@gmail.com> wrote:
>
>> Hi Yuji and Ben,
>>
>> I tried out this revised script and the same issue occurred to me, too.
>> I think it's definitely a bug to be solved asap.
>>
>> >Ben
>> What do you mean "an undocumented limitation" ?
>>
>> Thanks,
>> Hiro
>>
>> On Sat, Nov 26, 2016 at 3:13 PM, Ben Slater <ben.slater@instaclustr.com>
>> wrote:
>> > Nice detective work! Seems to me that it’s a best an undocumented
>> limitation
>> > and potentially could be viewed as a bug - maybe log another JIRA?
>> >
>> > One node - there is a nodetool truncatehints command that could be used
>> to
>> > clear out the hints
>> > (http://cassandra.apache.org/doc/latest/tools/nodetool/
>> truncatehints.html?highlight=truncate)
>> > . However, it seems to clear all hints on particular endpoint, not just
>> for
>> > a specific table.
>> >
>> > Cheers
>> > Ben
>> >
>> > On Fri, 25 Nov 2016 at 17:42 Yuji Ito <yuji@imagine-orb.com> wrote:
>> >>
>> >> Hi all,
>> >>
>> >> I revised the script to reproduce the issue.
>> >> I think the issue happens more frequently than before.
>> >> Killing another node is added to the previous script.
>> >>
>> >> ==== [script] ====
>> >> #!/bin/sh
>> >>
>> >> node1_ip=<node1 IP address>
>> >> node2_ip=<node2 IP address>
>> >> node3_ip=<node3 IP address>
>> >> node2_user=<user name>
>> >> node3_user=<user name>
>> >> rows=10000
>> >>
>> >> echo "consistency quorum;" > init_data.cql
>> >> for key in $(seq 0 $(expr $rows - 1))
>> >> do
>> >>     echo "insert into testdb.testtbl (key, val) values($key, 1111) IF
>> NOT
>> >> EXISTS;" >> init_data.cql
>> >>     done
>> >>
>> >>     while true
>> >>     do
>> >>     echo "truncate the table"
>> >>     cqlsh $node1_ip -e "truncate table testdb.testtbl" > /dev/null 2>&1
>> >>     if [ $? -ne 0 ]; then
>> >>         echo "truncating failed"
>> >>     continue
>> >>     else
>> >>         break
>> >>     fi
>> >> done
>> >>
>> >> echo "kill C* process on node3"
>> >> pdsh -l $node3_user -R ssh -w $node3_ip "ps auxww | grep
>> CassandraDaemon |
>> >> awk '{if (\$13 ~ /cassand/) print \$2}' | xargs sudo kill -9"
>> >>
>> >> echo "insert $rows rows"
>> >> cqlsh $node1_ip -f init_data.cql > insert_log 2>&1
>> >>
>> >> echo "restart C* process on node3"
>> >> pdsh -l $node3_user -R ssh -w $node3_ip "sudo /etc/init.d/cassandra
>> start"
>> >>
>> >> while true
>> >> do
>> >> echo "truncate the table again"
>> >> cqlsh $node1_ip -e "truncate table testdb.testtbl"
>> >> if [ $? -ne 0 ]; then
>> >>     echo "truncating failed"
>> >>         continue
>> >> else
>> >>     echo "truncation succeeded!"
>> >>     break
>> >> fi
>> >> done
>> >>
>> >> echo "kill C* process on node2"
>> >> pdsh -l $node2_user -R ssh -w $node2_ip "ps auxww | grep
>> CassandraDaemon |
>> >> awk '{if (\$13 ~ /cassand/) print \$2}' | xargs sudo kill -9"
>> >>
>> >> cqlsh $node1_ip --request-timeout 3600 -e "consistency serial; select
>> >> count(*) from testdb.testtbl;"
>> >> sleep 10
>> >> cqlsh $node1_ip --request-timeout 3600 -e "consistency serial; select
>> >> count(*) from testdb.testtbl;"
>> >>
>> >> echo "restart C* process on node2"
>> >> pdsh -l $node2_user -R ssh -w $node2_ip "sudo /etc/init.d/cassandra
>> start"
>> >>
>> >>
>> >> Thanks,
>> >> yuji
>> >>
>> >>
>> >> On Fri, Nov 18, 2016 at 7:52 PM, Yuji Ito <yuji@imagine-orb.com>
>> wrote:
>> >>>
>> >>> I investigated source code and logs of killed node.
>> >>> I guess that unexpected writes are executed when truncation is being
>> >>> executed.
>> >>>
>> >>> Some writes were executed after flush (the first flush) in truncation
>> and
>> >>> these writes could be read.
>> >>> These writes were requested as MUTATION by another node for hinted
>> >>> handoff.
>> >>> Their data was stored to a new memtable and flushed (the second
>> flush) to
>> >>> a new SSTable before snapshot in truncation.
>> >>> So, the truncation discarded only old SSTables, not the new SSTable.
>> >>> That's because ReplayPosition which was used for discarding SSTable
>> was
>> >>> that of the first flush.
>> >>>
>> >>> I copied some parts of log as below.
>> >>> "##" line is my comment.
>> >>> The point is that the ReplayPosition is moved forward by the second
>> >>> flush.
>> >>> It means some writes are executed after the first flush.
>> >>>
>> >>> == log ==
>> >>> ## started truncation
>> >>> TRACE [SharedPool-Worker-16] 2016-11-17 08:36:04,612
>> >>> ColumnFamilyStore.java:2790 - truncating testtbl
>> >>> ## the first flush started before truncation
>> >>> DEBUG [SharedPool-Worker-16] 2016-11-17 08:36:04,612
>> >>> ColumnFamilyStore.java:952 - Enqueuing flush of testtbl: 591360 (0%)
>> >>> on-heap, 0 (0%) off-heap
>> >>> INFO  [MemtableFlushWriter:1] 2016-11-17 08:36:04,613
>> Memtable.java:352 -
>> >>> Writing Memtable-testtbl@1863835308(42.625KiB serialized bytes, 2816
>> ops,
>> >>> 0%/0% of on/off-heap limit)
>> >>> ...
>> >>> DEBUG [MemtableFlushWriter:1] 2016-11-17 08:36:04,973
>> Memtable.java:386 -
>> >>> Completed flushing
>> >>> /var/lib/cassandra/data/testdb/testtbl-562848f0a55611e68b1451065d58fd
>> fb/tmp-lb-1-big-Data.db
>> >>> (17.651KiB) for commitlog position ReplayPosition(segmentId=
>> 1479371760395,
>> >>> position=315867)
>> >>> ## this ReplayPosition was used for discarding SSTables
>> >>> ...
>> >>> TRACE [MemtablePostFlush:1] 2016-11-17 08:36:05,022
>> CommitLog.java:298 -
>> >>> discard completed log segments for ReplayPosition(segmentId=
>> 1479371760395,
>> >>> position=315867), table 562848f0-a556-11e6-8b14-51065d58fdfb
>> >>> ## end of the first flush
>> >>> DEBUG [SharedPool-Worker-16] 2016-11-17 08:36:05,028
>> >>> ColumnFamilyStore.java:2823 - Discarding sstable data for truncated
>> CF +
>> >>> indexes
>> >>> ## the second flush before snapshot
>> >>> DEBUG [SharedPool-Worker-16] 2016-11-17 08:36:05,028
>> >>> ColumnFamilyStore.java:952 - Enqueuing flush of testtbl: 698880 (0%)
>> >>> on-heap, 0 (0%) off-heap
>> >>> INFO  [MemtableFlushWriter:2] 2016-11-17 08:36:05,029
>> Memtable.java:352 -
>> >>> Writing Memtable-testtbl@1186728207(50.375KiB serialized bytes, 3328
>> ops,
>> >>> 0%/0% of on/off-heap limit)
>> >>> ...
>> >>> DEBUG [MemtableFlushWriter:2] 2016-11-17 08:36:05,258
>> Memtable.java:386 -
>> >>> Completed flushing
>> >>> /var/lib/cassandra/data/testdb/testtbl-562848f0a55611e68b1451065d58fd
>> fb/tmp-lb-2-big-Data.db
>> >>> (17.696KiB) for commitlog position ReplayPosition(segmentId=
>> 1479371760395,
>> >>> position=486627)
>> >>> ...
>> >>> TRACE [MemtablePostFlush:1] 2016-11-17 08:36:05,289
>> CommitLog.java:298 -
>> >>> discard completed log segments for ReplayPosition(segmentId=
>> 1479371760395,
>> >>> position=486627), table 562848f0-a556-11e6-8b14-51065d58fdfb
>> >>> ## end of the second flush: position was moved
>> >>> ...
>> >>> ## only old SSTable was deleted because this SSTable was older than
>> >>> ReplayPosition(segmentId=1479371760395, position=315867)
>> >>> TRACE [NonPeriodicTasks:1] 2016-11-17 08:36:05,303 SSTable.java:118
-
>> >>> Deleted
>> >>> /var/lib/cassandra/data/testdb/testtbl-562848f0a55611e68b1451065d58fd
>> fb/lb-1-big
>> >>> ...
>> >>> TRACE [SharedPool-Worker-16] 2016-11-17 08:36:05,320
>> >>> ColumnFamilyStore.java:2841 - truncate complete
>> >>> TRACE [SharedPool-Worker-16] 2016-11-17 08:36:05,320
>> >>> TruncateVerbHandler.java:53 - Truncation(keyspace='testdb',
>> cf='testtbl')
>> >>> applied.  Enqueuing response to 36512@/10.91.145.7
>> >>> TRACE [SharedPool-Worker-16] 2016-11-17 08:36:05,320
>> >>> MessagingService.java:728 - /10.91.145.27 sending REQUEST_RESPONSE to
>> >>> 36512@/10.91.145.7
>> >>> ## end of truncation
>> >>> ====
>> >>>
>> >>> Actually, "truncated_at" of the table on the system.local after
>> running
>> >>> the script was 0x00000158716da30b0004d1db00000158716db524.
>> >>> It means segmentId=1479371760395, position=315867
>> >>> truncated_at=1479371765028 (2016-11-17 08:36:05,028)
>> >>>
>> >>> thanks,
>> >>> yuji
>> >>>
>> >>>
>> >>> On Wed, Nov 16, 2016 at 5:25 PM, Yuji Ito <yuji@imagine-orb.com>
>> wrote:
>> >>>>
>> >>>> Hi,
>> >>>>
>> >>>> I could find stale data after truncating a table.
>> >>>> It seems that truncating starts while recovery is being executed
just
>> >>>> after a node restarts.
>> >>>> After the truncating finishes, recovery still continues?
>> >>>> Is it expected?
>> >>>>
>> >>>> I use C* 2.2.8 and can reproduce it as below.
>> >>>>
>> >>>> ==== [create table] ====
>> >>>> cqlsh $ip -e "drop keyspace testdb;"
>> >>>> cqlsh $ip -e "CREATE KEYSPACE testdb WITH replication = {'class':
>> >>>> 'SimpleStrategy', 'replication_factor': '3'};"
>> >>>> cqlsh $ip -e "CREATE TABLE testdb.testtbl (key int PRIMARY KEY,
val
>> >>>> int);"
>> >>>>
>> >>>> ==== [script] ====
>> >>>> #!/bin/sh
>> >>>>
>> >>>> node1_ip=<node1 IP address>
>> >>>> node2_ip=<node2 IP address>
>> >>>> node3_ip=<node3 IP address>
>> >>>> node3_user=<user name>
>> >>>> rows=10000
>> >>>>
>> >>>> echo "consistency quorum;" > init_data.cql
>> >>>> for key in $(seq 0 $(expr $rows - 1))
>> >>>> do
>> >>>>     echo "insert into testdb.testtbl (key, val) values($key, 1111)
IF
>> >>>> NOT EXISTS;" >> init_data.cql
>> >>>> done
>> >>>>
>> >>>> while true
>> >>>> do
>> >>>> echo "truncate the table"
>> >>>> cqlsh $node1_ip -e "truncate table testdb.testtbl"
>> >>>> if [ $? -ne 0 ]; then
>> >>>>     echo "truncating failed"
>> >>>>     continue
>> >>>> else
>> >>>>     break
>> >>>> fi
>> >>>> done
>> >>>>
>> >>>> echo "kill C* process on node3"
>> >>>> pdsh -l $node3_user -R ssh -w $node3_ip "ps auxww | grep
>> CassandraDaemon
>> >>>> | awk '{if (\$13 ~ /cassand/) print \$2}' | xargs sudo kill -9"
>> >>>>
>> >>>> echo "insert $rows rows"
>> >>>> cqlsh $node1_ip -f init_data.cql > insert_log 2>&1
>> >>>>
>> >>>> echo "restart C* process on node3"
>> >>>> pdsh -l $node3_user -R ssh -w $node3_ip "sudo /etc/init.d/cassandra
>> >>>> start"
>> >>>>
>> >>>> while true
>> >>>> do
>> >>>> echo "truncate the table again"
>> >>>> cqlsh $node1_ip -e "truncate table testdb.testtbl"
>> >>>> if [ $? -ne 0 ]; then
>> >>>>     echo "truncating failed"
>> >>>>     continue
>> >>>> else
>> >>>>     break
>> >>>> fi
>> >>>> done
>> >>>>
>> >>>> cqlsh $node1_ip --request-timeout 3600 -e "consistency serial; select
>> >>>> count(*) from testdb.testtbl;"
>> >>>> sleep 10
>> >>>> cqlsh $node1_ip --request-timeout 3600 -e "consistency serial; select
>> >>>> count(*) from testdb.testtbl;"
>> >>>>
>> >>>>
>> >>>> ==== [result] ====
>> >>>> truncate the table
>> >>>> kill C* process on node3
>> >>>> insert 10000 rows
>> >>>> restart C* process on node3
>> >>>> 10.91.145.27: Starting Cassandra: OK
>> >>>> truncate the table again
>> >>>> <stdin>:1:TruncateError: Error during truncate: Cannot achieve
>> >>>> consistency level ALL
>> >>>> truncating failed
>> >>>> truncate the table again
>> >>>> <stdin>:1:TruncateError: Error during truncate: Cannot achieve
>> >>>> consistency level ALL
>> >>>> truncating failed
>> >>>> truncate the table again
>> >>>> <stdin>:1:TruncateError: Error during truncate: Cannot achieve
>> >>>> consistency level ALL
>> >>>> truncating failed
>> >>>> truncate the table again
>> >>>> <stdin>:1:TruncateError: Error during truncate: Cannot achieve
>> >>>> consistency level ALL
>> >>>> truncating failed
>> >>>> truncate the table again
>> >>>> <stdin>:1:TruncateError: Error during truncate: Cannot achieve
>> >>>> consistency level ALL
>> >>>> truncating failed
>> >>>> truncate the table again
>> >>>> <stdin>:1:TruncateError: Error during truncate: Cannot achieve
>> >>>> consistency level ALL
>> >>>> truncating failed
>> >>>> truncate the table again
>> >>>> Consistency level set to SERIAL.
>> >>>>
>> >>>>  count
>> >>>> -------
>> >>>>    300
>> >>>>
>> >>>> (1 rows)
>> >>>>
>> >>>> Warnings :
>> >>>> Aggregation query used without partition key
>> >>>>
>> >>>> Consistency level set to SERIAL.
>> >>>>
>> >>>>  count
>> >>>> -------
>> >>>>   2304
>> >>>>
>> >>>> (1 rows)
>> >>>>
>> >>>> Warnings :
>> >>>> Aggregation query used without partition key
>> >>>> ====
>> >>>>
>> >>>> I found it when I was investigating data lost problem. (Ref. "failure
>> >>>> node rejoin" thread)
>> >>>> I'm not sure this problem is related to data lost.
>> >>>>
>> >>>> Thanks,
>> >>>> yuji
>> >>>
>> >>>
>> >>
>> >
>>
>

Mime
View raw message