cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hannu Kröger <hkro...@gmail.com>
Subject Re: Range deletes, wide partitions, and reverse iterators
Date Tue, 16 May 2017 18:05:04 GMT
Yes, I agree. I would say it cannot skip those cells because it doesn’t check the max timestamp
of the cells of the sstable and therefore scans them one by one.

Hannu
 
> On 16 May 2017, at 19:48, Stefano Ortolani <ostefano@gmail.com> wrote:
> 
> But it should skip those records since they are sorted. My understanding would be something
like:
> 
> 1) read sstable 2
> 2) read the range tombstone
> 3) skip records from sstable2 and sstable1 within the range boundaries
> 4) read remaining records from sstable1
> 5) no records, return
> 
> On Tue, May 16, 2017 at 5:43 PM, Hannu Kröger <hkroger@gmail.com <mailto:hkroger@gmail.com>>
wrote:
> This is a bit of guessing but it probably reads sstables in some sort of sequence, so
even if sstable 2 contains the tombstone, it still scans through the sstable 1 for possible
data to be read.
> 
> BR,
> Hannu
> 
>> On 16 May 2017, at 19:40, Stefano Ortolani <ostefano@gmail.com <mailto:ostefano@gmail.com>>
wrote:
>> 
>> Little update: also the following query timeouts, which is weird since the range
tombstone should have been read by then...
>> 
>> SELECT * 
>> FROM test_cql.test_cf 
>> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf 
>> AND timeid < the_oldest_deleted_timeid
>> ORDER BY timeid DESC;
>> 
>> 
>> 
>> On Tue, May 16, 2017 at 5:17 PM, Stefano Ortolani <ostefano@gmail.com <mailto:ostefano@gmail.com>>
wrote:
>> Yes, that was my intention but I wanted to cross-check with the ML and the devs keeping
an eye on it first.
>> 
>> On Tue, May 16, 2017 at 5:10 PM, Hannu Kröger <hkroger@gmail.com <mailto:hkroger@gmail.com>>
wrote:
>> Well,
>> 
>> sstables contain some statistics about the cell timestamps and using that information
and the tombstone timestamp it might be possible to skip some data but I’m not sure that
Cassandra currently does that. Maybe it would be worth a JIRA ticket and see what the devs
think about it. If optimizing this case would make sense.
>> 
>> Hannu
>> 
>>> On 16 May 2017, at 18:03, Stefano Ortolani <ostefano@gmail.com <mailto:ostefano@gmail.com>>
wrote:
>>> 
>>> Hi Hannu,
>>> 
>>> the piece of data in question is older. In my example the tombstone is the newest
piece of data.
>>> Since a range tombstone has information re the clustering key ranges, and the
data is clustering key sorted, I would expect a linear scan not to be necessary.
>>> 
>>> On Tue, May 16, 2017 at 3:46 PM, Hannu Kröger <hkroger@gmail.com <mailto:hkroger@gmail.com>>
wrote:
>>> Well, as mentioned, probably Cassandra doesn’t have logic and data to skip
bigger regions of deleted data based on range tombstone. If some piece of data in a partition
is newer than the tombstone, then it cannot be skipped. Therefore some partition level statistics
of cell ages would need to be kept in the column index for the skipping and that is probably
not there.
>>> 
>>> Hannu 
>>> 
>>>> On 16 May 2017, at 17:33, Stefano Ortolani <ostefano@gmail.com <mailto:ostefano@gmail.com>>
wrote:
>>>> 
>>>> That is another way to see the question: are reverse iterators range tombstone
aware? Yes.
>>>> That is why I am puzzled by this afore-mentioned behavior. 
>>>> I would expect them to handle this case more gracefully.
>>>> 
>>>> Cheers,
>>>> Stefano
>>>> 
>>>> On Tue, May 16, 2017 at 3:29 PM, Nitan Kainth <nitan@bamlabs.com <mailto:nitan@bamlabs.com>>
wrote:
>>>> Hannu,
>>>> 
>>>> How can you read a partition in reverse?
>>>> 
>>>> Sent from my iPhone
>>>> 
>>>> > On May 16, 2017, at 9:20 AM, Hannu Kröger <hkroger@gmail.com <mailto:hkroger@gmail.com>>
wrote:
>>>> >
>>>> > Well, I’m guessing that Cassandra doesn't really know if the range
tombstone is useful for this or not.
>>>> >
>>>> > In many cases it might be that the partition contains data that is within
the range of the tombstone but is newer than the tombstone and therefore it might be still
be returned. Scanning through deleted data can be avoided by reading the partition in reverse
(if all the deleted data is in the beginning of the partition). Eventually you will still
end up reading a lot of tombstones but you will get a lot of live data first and the implicit
query limit of 10000 probably is reached before you get to the tombstones. Therefore you will
get an immediate answer.
>>>> >
>>>> > Does it make sense?
>>>> >
>>>> > Hannu
>>>> >
>>>> >> On 16 May 2017, at 16:33, Stefano Ortolani <ostefano@gmail.com
<mailto:ostefano@gmail.com>> wrote:
>>>> >>
>>>> >> Hi all,
>>>> >>
>>>> >> I am seeing inconsistencies when mixing range tombstones, wide partitions,
and reverse iterators.
>>>> >> I still have to understand if the behaviour is to be expected hence
the message on the mailing list.
>>>> >>
>>>> >> The situation is conceptually simple. I am using a table defined
as follows:
>>>> >>
>>>> >> CREATE TABLE test_cql.test_cf (
>>>> >>  hash blob,
>>>> >>  timeid timeuuid,
>>>> >>  PRIMARY KEY (hash, timeid)
>>>> >> ) WITH CLUSTERING ORDER BY (timeid ASC)
>>>> >>  AND compaction = {'class' : 'LeveledCompactionStrategy'};
>>>> >>
>>>> >> I then proceed by loading 2/3GB from 3 sstables which I know contain
a really wide partition (> 512 MB) for `hash = x`. I then delete the oldest _half_ of that
partition by executing the query below, and restart the node:
>>>> >>
>>>> >> DELETE
>>>> >> FROM test_cql.test_cf
>>>> >> WHERE hash = x AND timeid < y;
>>>> >>
>>>> >> If I keep compactions disabled the following query timeouts (takes
more than 10 seconds to
>>>> >> succeed):
>>>> >>
>>>> >> SELECT *
>>>> >> FROM test_cql.test_cf
>>>> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
>>>> >> ORDER BY timeid ASC;
>>>> >>
>>>> >> While the following returns immediately (obviously because no deleted
data is ever read):
>>>> >>
>>>> >> SELECT *
>>>> >> FROM test_cql.test_cf
>>>> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
>>>> >> ORDER BY timeid DESC;
>>>> >>
>>>> >> If I force a compaction the problem is gone, but I presume just
because the data is rearranged.
>>>> >>
>>>> >> It seems to me that reading by ASC does not make use of the range
tombstone until C* reads the
>>>> >> last sstables (which actually contains the range tombstone and is
flushed at node restart), and it wastes time reading all rows that are actually not live anymore.
>>>> >>
>>>> >> Is this expected? Should the range tombstone actually help in these
cases?
>>>> >>
>>>> >> Thanks a lot!
>>>> >> Stefano
>>>> >
>>>> >
>>>> > ---------------------------------------------------------------------
>>>> > To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org <mailto:user-unsubscribe@cassandra.apache.org>
>>>> > For additional commands, e-mail: user-help@cassandra.apache.org <mailto:user-help@cassandra.apache.org>
>>>> >
>>>> 
>>> 
>>> 
>> 
>> 
>> 
> 
> 


Mime
View raw message