Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id E2A58200C7D for ; Tue, 16 May 2017 18:49:04 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id E1364160BC1; Tue, 16 May 2017 16:49:04 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id B2045160BA6 for ; Tue, 16 May 2017 18:49:03 +0200 (CEST) Received: (qmail 55811 invoked by uid 500); 16 May 2017 16:49:01 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 55801 invoked by uid 99); 16 May 2017 16:49:00 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 16 May 2017 16:49:00 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 8E9E01809D6 for ; Tue, 16 May 2017 16:49:00 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.629 X-Spam-Level: ** X-Spam-Status: No, score=2.629 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, KAM_LOTSOFHASH=0.25, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id Y3qIBJStUGX6 for ; Tue, 16 May 2017 16:48:58 +0000 (UTC) Received: from mail-ua0-f177.google.com (mail-ua0-f177.google.com [209.85.217.177]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 7694A5F567 for ; Tue, 16 May 2017 16:48:57 +0000 (UTC) Received: by mail-ua0-f177.google.com with SMTP id e55so103796491uaa.2 for ; Tue, 16 May 2017 09:48:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=Nnm+oGNRb+ad4Mi3mXtNNr4m3Y9pDoGHXAxmIcWqSH0=; b=HrE62WMlg9TlETfnFwn/Hco4zepc98faRphATU/acWpoF0J6vz6xP+BbhjXwOyff/v tri75GNZXULFGZ/9ntGe9fve1t441Dv9oQfuEUZHrlaH6X4q5YUN3To1idzBgDayptCS pXAEFuAMDuabxI0U2w+shXRyryQakVooYMHfI4Fyk6/Ib7W5MbqS6RwTmnNuGa+NrvN1 shN/aXhuChQD/cgQo8eGt53MBZRNacqSK2ergVQIFaLWJ2YoY/AJHuCwJAA/cHYUexML ZKY4dpHtgmttQ87Ha1g92CfZlEzL7WwlsrCfYLFeqKJircyP2fSCQ9OvC3eW2S0cJkPt PG8Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=Nnm+oGNRb+ad4Mi3mXtNNr4m3Y9pDoGHXAxmIcWqSH0=; b=BsKPTCH0NJbip4mvfwqnfM0aLy9G/YDofaxBHrEhQR/Wpun6xZDZqpkg1cJmnl85PU k9aRtPtxLHBNlMsL3t+ouLs29ZMzMkYj+NTBZ4idAkIzsNwAqzGxztQLH9am7NYRWvQ7 3I4VVkvQx+ZBI0oF2B6Tz9IwAZ/uCoo+Gl7b9xWfDpgdCpeozdqNZ4uuBKbXGzawskh8 zWVDxaZ9TmRBOIIDObvjNCQb9i/9C7/0iwNLhMc821WyDaspOg5zzgqHGa9DYApLxFXN a/M3dRNjXbeK0dCsDK69lh1qyjXXNswP3W/Q3TWXdAItbHDhifgTQbQObI/QWPIgcKik yoZw== X-Gm-Message-State: AODbwcBOMB+nb3RuLM9mtOrmFZu6fuXzrR/Ft0GLhmHwi/V4QIXOaeF6 nHBmzbbRE5rgtMFip52HhaVACjamrA== X-Received: by 10.176.2.98 with SMTP id 89mr6663042uas.152.1494953336163; Tue, 16 May 2017 09:48:56 -0700 (PDT) MIME-Version: 1.0 Received: by 10.103.80.23 with HTTP; Tue, 16 May 2017 09:48:35 -0700 (PDT) In-Reply-To: References: <35D525BF-765D-4D5D-BFBD-A8A36EEC53BA@gmail.com> <93D97AF4-758E-46D8-811F-CE3A9BAF7811@bamlabs.com> From: Stefano Ortolani Date: Tue, 16 May 2017 17:48:35 +0100 Message-ID: Subject: Re: Range deletes, wide partitions, and reverse iterators To: =?UTF-8?Q?Hannu_Kr=C3=B6ger?= Cc: Nitan Kainth , user@cassandra.apache.org Content-Type: multipart/alternative; boundary="001a113e22de8ed382054fa6f212" archived-at: Tue, 16 May 2017 16:49:05 -0000 --001a113e22de8ed382054fa6f212 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable But it should skip those records since they are sorted. My understanding would be something like: 1) read sstable 2 2) read the range tombstone 3) skip records from sstable2 and sstable1 within the range boundaries 4) read remaining records from sstable1 5) no records, return On Tue, May 16, 2017 at 5:43 PM, Hannu Kr=C3=B6ger wrot= e: > This is a bit of guessing but it probably reads sstables in some sort of > sequence, so even if sstable 2 contains the tombstone, it still scans > through the sstable 1 for possible data to be read. > > BR, > Hannu > > On 16 May 2017, at 19:40, Stefano Ortolani wrote: > > Little update: also the following query timeouts, which is weird since th= e > range tombstone should have been read by then... > > SELECT * > FROM test_cql.test_cf > WHERE hash =3D 0x963204d451de3e611daf5e340c3594acead0eaaf > AND timeid < the_oldest_deleted_timeid > ORDER BY timeid DESC; > > > > On Tue, May 16, 2017 at 5:17 PM, Stefano Ortolani > wrote: > >> Yes, that was my intention but I wanted to cross-check with the ML and >> the devs keeping an eye on it first. >> >> On Tue, May 16, 2017 at 5:10 PM, Hannu Kr=C3=B6ger w= rote: >> >>> Well, >>> >>> sstables contain some statistics about the cell timestamps and using >>> that information and the tombstone timestamp it might be possible to sk= ip >>> some data but I=E2=80=99m not sure that Cassandra currently does that. = Maybe it >>> would be worth a JIRA ticket and see what the devs think about it. If >>> optimizing this case would make sense. >>> >>> Hannu >>> >>> On 16 May 2017, at 18:03, Stefano Ortolani wrote: >>> >>> Hi Hannu, >>> >>> the piece of data in question is older. In my example the tombstone is >>> the newest piece of data. >>> Since a range tombstone has information re the clustering key ranges, >>> and the data is clustering key sorted, I would expect a linear scan not= to >>> be necessary. >>> >>> On Tue, May 16, 2017 at 3:46 PM, Hannu Kr=C3=B6ger = wrote: >>> >>>> Well, as mentioned, probably Cassandra doesn=E2=80=99t have logic and = data to >>>> skip bigger regions of deleted data based on range tombstone. If some = piece >>>> of data in a partition is newer than the tombstone, then it cannot be >>>> skipped. Therefore some partition level statistics of cell ages would = need >>>> to be kept in the column index for the skipping and that is probably n= ot >>>> there. >>>> >>>> Hannu >>>> >>>> On 16 May 2017, at 17:33, Stefano Ortolani wrote: >>>> >>>> That is another way to see the question: are reverse iterators range >>>> tombstone aware? Yes. >>>> That is why I am puzzled by this afore-mentioned behavior. >>>> I would expect them to handle this case more gracefully. >>>> >>>> Cheers, >>>> Stefano >>>> >>>> On Tue, May 16, 2017 at 3:29 PM, Nitan Kainth >>>> wrote: >>>> >>>>> Hannu, >>>>> >>>>> How can you read a partition in reverse? >>>>> >>>>> Sent from my iPhone >>>>> >>>>> > On May 16, 2017, at 9:20 AM, Hannu Kr=C3=B6ger = wrote: >>>>> > >>>>> > Well, I=E2=80=99m guessing that Cassandra doesn't really know if th= e range >>>>> tombstone is useful for this or not. >>>>> > >>>>> > In many cases it might be that the partition contains data that is >>>>> within the range of the tombstone but is newer than the tombstone and >>>>> therefore it might be still be returned. Scanning through deleted dat= a can >>>>> be avoided by reading the partition in reverse (if all the deleted da= ta is >>>>> in the beginning of the partition). Eventually you will still end up >>>>> reading a lot of tombstones but you will get a lot of live data first= and >>>>> the implicit query limit of 10000 probably is reached before you get = to the >>>>> tombstones. Therefore you will get an immediate answer. >>>>> > >>>>> > Does it make sense? >>>>> > >>>>> > Hannu >>>>> > >>>>> >> On 16 May 2017, at 16:33, Stefano Ortolani >>>>> wrote: >>>>> >> >>>>> >> Hi all, >>>>> >> >>>>> >> I am seeing inconsistencies when mixing range tombstones, wide >>>>> partitions, and reverse iterators. >>>>> >> I still have to understand if the behaviour is to be expected henc= e >>>>> the message on the mailing list. >>>>> >> >>>>> >> The situation is conceptually simple. I am using a table defined a= s >>>>> follows: >>>>> >> >>>>> >> CREATE TABLE test_cql.test_cf ( >>>>> >> hash blob, >>>>> >> timeid timeuuid, >>>>> >> PRIMARY KEY (hash, timeid) >>>>> >> ) WITH CLUSTERING ORDER BY (timeid ASC) >>>>> >> AND compaction =3D {'class' : 'LeveledCompactionStrategy'}; >>>>> >> >>>>> >> I then proceed by loading 2/3GB from 3 sstables which I know >>>>> contain a really wide partition (> 512 MB) for `hash =3D x`. I then d= elete >>>>> the oldest _half_ of that partition by executing the query below, and >>>>> restart the node: >>>>> >> >>>>> >> DELETE >>>>> >> FROM test_cql.test_cf >>>>> >> WHERE hash =3D x AND timeid < y; >>>>> >> >>>>> >> If I keep compactions disabled the following query timeouts (takes >>>>> more than 10 seconds to >>>>> >> succeed): >>>>> >> >>>>> >> SELECT * >>>>> >> FROM test_cql.test_cf >>>>> >> WHERE hash =3D 0x963204d451de3e611daf5e340c3594acead0eaaf >>>>> >> ORDER BY timeid ASC; >>>>> >> >>>>> >> While the following returns immediately (obviously because no >>>>> deleted data is ever read): >>>>> >> >>>>> >> SELECT * >>>>> >> FROM test_cql.test_cf >>>>> >> WHERE hash =3D 0x963204d451de3e611daf5e340c3594acead0eaaf >>>>> >> ORDER BY timeid DESC; >>>>> >> >>>>> >> If I force a compaction the problem is gone, but I presume just >>>>> because the data is rearranged. >>>>> >> >>>>> >> It seems to me that reading by ASC does not make use of the range >>>>> tombstone until C* reads the >>>>> >> last sstables (which actually contains the range tombstone and is >>>>> flushed at node restart), and it wastes time reading all rows that ar= e >>>>> actually not live anymore. >>>>> >> >>>>> >> Is this expected? Should the range tombstone actually help in thes= e >>>>> cases? >>>>> >> >>>>> >> Thanks a lot! >>>>> >> Stefano >>>>> > >>>>> > >>>>> > ------------------------------------------------------------ >>>>> --------- >>>>> > To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org >>>>> > For additional commands, e-mail: user-help@cassandra.apache.org >>>>> > >>>>> >>>> >>>> >>>> >>> >>> >> > > --001a113e22de8ed382054fa6f212 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
But it should skip those records since they are sorted. My= understanding would be something like:

1) read sstable = 2
2) read the range tombstone
3) skip records from ssta= ble2 and sstable1 within the range boundaries
4) read remaining r= ecords from sstable1
5) no records, return

On Tue, May 16, 2017 at 5:4= 3 PM, Hannu Kr=C3=B6ger <hkroger@gmail.com> wrote:
This is a= bit of guessing but it probably reads sstables in some sort of sequence, s= o even if sstable 2 contains the tombstone, it still scans through the ssta= ble 1 for possible data to be read.

BR,
= Hannu

O= n 16 May 2017, at 19:40, Stefano Ortolani <ostefano@gmail.com> wrote:

L= ittle update: also the following query timeouts, which is weird since the r= ange tombstone should have been read by then...

SELECT *=C2=A0
FROM test_cql.test_cf=C2=A0
WHERE=C2=A0hash=C2=A0=3D 0x963204d451de3e6= 11daf5e340c3594acead0eaaf=C2=A0
AND timeid < the_oldest_deleted_timeid
ORDER BY timeid DESC;



On Tue, May 16, 2017 at 5:17 PM, Stefano Ortolani <ostefano@gmail= .com> wrote:
Yes, that was my intention but I wanted to cross-check with the ML and t= he devs keeping an eye on it first.
=
On Tue, May 16, 2017 at 5:10 PM, Hannu Kr=C3= =B6ger <hkroger@gmail.com> wrote:
Well,

sstabl= es contain some statistics about the cell timestamps and using that informa= tion and the tombstone timestamp it might be possible to skip some data but= I=E2=80=99m not sure that Cassandra currently does that. Maybe it would be= worth a JIRA ticket and see what the devs think about it. If optimizing th= is case would make sense.

Hannu

On 16 May 2017= , at 18:03, Stefano Ortolani <ostefano@gmail.com> wrote:

Hi Hannu,

the piece of data = in question is older. In my example the tombstone is the newest piece of da= ta.
Since a range tombstone has information re the clustering key= ranges, and the data is clustering key sorted, I would expect a linear sca= n not to be necessary.

On Tue, May 16, 2017 at 3:46 PM, Hannu Kr=C3=B6ger <hkrog= er@gmail.com> wrote:
Well, as mentioned, probably Cassandra d= oesn=E2=80=99t have logic and data to skip bigger regions of deleted data b= ased on range tombstone. If some piece of data in a partition is newer than= the tombstone, then it cannot be skipped. Therefore some partition level s= tatistics of cell ages would need to be kept in the column index for the sk= ipping and that is probably not there.

Hannu=C2=A0

=
On 16 May 2017, at 17:33, Stefano Ortolani &= lt;ostefano@gmail.c= om> wrote:

That is another way to see the question: are reverse it= erators range tombstone aware? Yes.
That is why I am puzzled by this af= ore-mentioned behavior.=C2=A0
I would expect them to handle this = case more gracefully.

Cheers,
Stefano

On Tue= , May 16, 2017 at 3:29 PM, Nitan Kainth <nitan@bamlabs.com> = wrote:
Hannu,

How can you read a partition in reverse?

Sent from my iPhone

> On May 16, 2017, at 9:20 AM, Hannu Kr=C3=B6ger <hkroger@gmail.com> wrote:
>
> Well, I=E2=80=99m guessing that Cassandra doesn't really know if t= he range tombstone is useful for this or not.
>
> In many cases it might be that the partition contains data that is wit= hin the range of the tombstone but is newer than the tombstone and therefor= e it might be still be returned. Scanning through deleted data can be avoid= ed by reading the partition in reverse (if all the deleted data is in the b= eginning of the partition). Eventually you will still end up reading a lot = of tombstones but you will get a lot of live data first and the implicit qu= ery limit of 10000 probably is reached before you get to the tombstones. Th= erefore you will get an immediate answer.
>
> Does it make sense?
>
> Hannu
>
>> On 16 May 2017, at 16:33, Stefano Ortolani <ostefano@gmail.com> wrote:
>>
>> Hi all,
>>
>> I am seeing inconsistencies when mixing range tombstones, wide par= titions, and reverse iterators.
>> I still have to understand if the behaviour is to be expected henc= e the message on the mailing list.
>>
>> The situation is conceptually simple. I am using a table defined a= s follows:
>>
>> CREATE TABLE test_cql.test_cf (
>>=C2=A0 hash blob,
>>=C2=A0 timeid timeuuid,
>>=C2=A0 PRIMARY KEY (hash, timeid)
>> ) WITH CLUSTERING ORDER BY (timeid ASC)
>>=C2=A0 AND compaction =3D {'class' : 'LeveledCompaction= Strategy'};
>>
>> I then proceed by loading 2/3GB from 3 sstables which I know conta= in a really wide partition (> 512 MB) for `hash =3D x`. I then delete th= e oldest _half_ of that partition by executing the query below, and restart= the node:
>>
>> DELETE
>> FROM test_cql.test_cf
>> WHERE hash =3D x AND timeid < y;
>>
>> If I keep compactions disabled the following query timeouts (takes= more than 10 seconds to
>> succeed):
>>
>> SELECT *
>> FROM test_cql.test_cf
>> WHERE hash =3D 0x963204d451de3e611daf5e340c3594acead0eaaf
>> ORDER BY timeid ASC;
>>
>> While the following returns immediately (obviously because no dele= ted data is ever read):
>>
>> SELECT *
>> FROM test_cql.test_cf
>> WHERE hash =3D 0x963204d451de3e611daf5e340c3594acead0eaaf
>> ORDER BY timeid DESC;
>>
>> If I force a compaction the problem is gone, but I presume just be= cause the data is rearranged.
>>
>> It seems to me that reading by ASC does not make use of the range = tombstone until C* reads the
>> last sstables (which actually contains the range tombstone and is = flushed at node restart), and it wastes time reading all rows that are actu= ally not live anymore.
>>
>> Is this expected? Should the range tombstone actually help in thes= e cases?
>>
>> Thanks a lot!
>> Stefano
>
>
> -----------------------------------------------------= ----------------
> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org=
> For additional commands, e-mail: user-help@cassandra.apache.org
>







--001a113e22de8ed382054fa6f212--