Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2935C106BF for ; Fri, 11 Apr 2014 17:34:02 +0000 (UTC) Received: (qmail 84021 invoked by uid 500); 11 Apr 2014 17:34:00 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 83146 invoked by uid 500); 11 Apr 2014 17:33:57 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 83138 invoked by uid 99); 11 Apr 2014 17:33:56 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 11 Apr 2014 17:33:56 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of paulo.motta@chaordicsystems.com designates 209.85.160.176 as permitted sender) Received: from [209.85.160.176] (HELO mail-yk0-f176.google.com) (209.85.160.176) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 11 Apr 2014 17:33:52 +0000 Received: by mail-yk0-f176.google.com with SMTP id 19so5123029ykq.21 for ; Fri, 11 Apr 2014 10:33:31 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=0mBpkzt9eAAOTVW6ZmnPx3SHvV0kbEsLEHBQTQRss2c=; b=bsoEA+uo6c+OiBY5s+4fJVNAPcumiqaTrs2EHJSg6G0e6zhCINStXMlO2CYVe8pmN9 IQc4Ieup2wTnFT1cePJWbAQCOjJtTmi/iQ3AIIuDK3M5ec33Egmtd4u1Z7kmZKgq11HS mNHdIS5/xDWRLM13IlnU2MBJqFVYgD2JvEYJSxf4SuQghTxCWfp01yllGeuI2A0W4mgZ d3Hve7gNgLtE6dCUbDeSEgmKFD3OhTD1lTcFpiv0x718wQj6JObU0YN3VnufRVJL3Zq0 zKhiGEtiPLNXJ3JyPeDpzO00U7IEk7TXQ27+5SQ7NgIvnWqVnupxOKPaVup+a3d8f7Ek XZyg== X-Gm-Message-State: ALoCoQmeW/hmNN/cDX4JZYLDDhZKYkFHSbEyYbqXX7GJAzT8taa9nd+w/R0rVBTwJYBn6pfK5ExE X-Received: by 10.236.30.230 with SMTP id k66mr33854556yha.57.1397237611670; Fri, 11 Apr 2014 10:33:31 -0700 (PDT) MIME-Version: 1.0 Received: by 10.170.168.87 with HTTP; Fri, 11 Apr 2014 10:33:11 -0700 (PDT) In-Reply-To: References: <2FD401B1-E723-4022-9C74-5793E2E987DB@adgear.com> From: Paulo Ricardo Motta Gomes Date: Fri, 11 Apr 2014 14:33:11 -0300 Message-ID: Subject: Re: clearing tombstones? To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=089e0163407482cb1a04f6c7bbec X-Virus-Checked: Checked by ClamAV on apache.org --089e0163407482cb1a04f6c7bbec Content-Type: text/plain; charset=ISO-8859-1 This thread is really informative, thanks for the good feedback. My question is : Is there a way to force tombstones to be clared with LCS? Does scrub help in any case? Or the only solution would be to create a new CF and migrate all the data if you intend to do a large CF cleanup? Cheers, On Fri, Apr 11, 2014 at 2:02 PM, Mark Reddy wrote: > Thats great Will, if you could update the thread with the actions you > decide to take and the results that would be great. > > > Mark > > > On Fri, Apr 11, 2014 at 5:53 PM, William Oberman > wrote: > >> I've learned a *lot* from this thread. My thanks to all of the >> contributors! >> >> Paulo: Good luck with LCS. I wish I could help there, but all of my CF's >> are SizeTiered (mostly as I'm on the same schema/same settings since 0.7...) >> >> will >> >> >> >> On Fri, Apr 11, 2014 at 12:14 PM, Mina Naguib wrote: >> >>> >>> Levelled Compaction is a wholly different beast when it comes to >>> tombstones. >>> >>> The tombstones are inserted, like any other write really, at the lower >>> levels in the leveldb hierarchy. >>> >>> They are only removed after they have had the chance to "naturally" >>> migrate upwards in the leveldb hierarchy to the highest level in your data >>> store. How long that takes depends on: >>> 1. The amount of data in your store and the number of levels your LCS >>> strategy has >>> 2. The amount of new writes entering the bottom funnel of your leveldb, >>> forcing upwards compaction and combining >>> >>> To give you an idea, I had a similar scenario and ran a (slow, >>> throttled) delete job on my cluster around December-January. Here's a >>> graph of the disk space usage on one node. Notice the still-diclining >>> usage long after the cleanup job has finished (sometime in January). I >>> tend to think of tombstones in LCS as little bombs that get to explode much >>> later in time: >>> >>> http://mina.naguib.ca/images/tombstones-cassandra-LCS.jpg >>> >>> >>> >>> On 2014-04-11, at 11:20 AM, Paulo Ricardo Motta Gomes < >>> paulo.motta@chaordicsystems.com> wrote: >>> >>> I have a similar problem here, I deleted about 30% of a very large CF >>> using LCS (about 80GB per node), but still my data hasn't shrinked, even if >>> I used 1 day for gc_grace_seconds. Would nodetool scrub help? Does nodetool >>> scrub forces a minor compaction? >>> >>> Cheers, >>> >>> Paulo >>> >>> >>> On Fri, Apr 11, 2014 at 12:12 PM, Mark Reddy wrote: >>> >>>> Yes, running nodetool compact (major compaction) creates one large >>>> SSTable. This will mess up the heuristics of the SizeTiered strategy (is >>>> this the compaction strategy you are using?) leading to multiple 'small' >>>> SSTables alongside the single large SSTable, which results in increased >>>> read latency. You will incur the operational overhead of having to manage >>>> compactions if you wish to compact these smaller SSTables. For all these >>>> reasons it is generally advised to stay away from running compactions >>>> manually. >>>> >>>> Assuming that this is a production environment and you want to keep >>>> everything running as smoothly as possible I would reduce the gc_grace on >>>> the CF, allow automatic minor compactions to kick in and then increase the >>>> gc_grace once again after the tombstones have been removed. >>>> >>>> >>>> On Fri, Apr 11, 2014 at 3:44 PM, William Oberman < >>>> oberman@civicscience.com> wrote: >>>> >>>>> So, if I was impatient and just "wanted to make this happen now", I >>>>> could: >>>>> >>>>> 1.) Change GCGraceSeconds of the CF to 0 >>>>> 2.) run nodetool compact (*) >>>>> 3.) Change GCGraceSeconds of the CF back to 10 days >>>>> >>>>> Since I have ~900M tombstones, even if I miss a few due to impatience, >>>>> I don't care *that* much as I could re-run my clean up tool against the now >>>>> much smaller CF. >>>>> >>>>> (*) A long long time ago I seem to recall reading advice about "don't >>>>> ever run nodetool compact", but I can't remember why. Is there any bad >>>>> long term consequence? Short term there are several: >>>>> -a heavy operation >>>>> -temporary 2x disk space >>>>> -one big SSTable afterwards >>>>> But moving forward, everything is ok right? >>>>> CommitLog/MemTable->SStables, minor compactions that merge SSTables, >>>>> etc... The only flaw I can think of is it will take forever until the >>>>> SSTable minor compactions build up enough to consider including the big >>>>> SSTable in a compaction, making it likely I'll have to self manage >>>>> compactions. >>>>> >>>>> >>>>> >>>>> On Fri, Apr 11, 2014 at 10:31 AM, Mark Reddy wrote: >>>>> >>>>>> Correct, a tombstone will only be removed after gc_grace period has >>>>>> elapsed. The default value is set to 10 days which allows a great deal of >>>>>> time for consistency to be achieved prior to deletion. If you are >>>>>> operationally confident that you can achieve consistency via anti-entropy >>>>>> repairs within a shorter period you can always reduce that 10 day interval. >>>>>> >>>>>> >>>>>> Mark >>>>>> >>>>>> >>>>>> On Fri, Apr 11, 2014 at 3:16 PM, William Oberman < >>>>>> oberman@civicscience.com> wrote: >>>>>> >>>>>>> I'm seeing a lot of articles about a dependency between removing >>>>>>> tombstones and GCGraceSeconds, which might be my problem (I just checked, >>>>>>> and this CF has GCGraceSeconds of 10 days). >>>>>>> >>>>>>> >>>>>>> On Fri, Apr 11, 2014 at 10:10 AM, tommaso barbugli < >>>>>>> tbarbugli@gmail.com> wrote: >>>>>>> >>>>>>>> compaction should take care of it; for me it never worked so I run >>>>>>>> nodetool compaction on every node; that does it. >>>>>>>> >>>>>>>> >>>>>>>> 2014-04-11 16:05 GMT+02:00 William Oberman < >>>>>>>> oberman@civicscience.com>: >>>>>>>> >>>>>>>> I'm wondering what will clear tombstoned rows? nodetool cleanup, >>>>>>>>> nodetool repair, or time (as in just wait)? >>>>>>>>> >>>>>>>>> I had a CF that was more or less storing session information. >>>>>>>>> After some time, we decided that one piece of this information was >>>>>>>>> pointless to track (and was 90%+ of the columns, and in 99% of those cases >>>>>>>>> was ALL columns for a row). I wrote a process to remove all of those >>>>>>>>> columns (which again in a vast majority of cases had the effect of removing >>>>>>>>> the whole row). >>>>>>>>> >>>>>>>>> This CF had ~1 billion rows, so I expect to be left with ~100m >>>>>>>>> rows. After I did this mass delete, everything was the same size on disk >>>>>>>>> (which I expected, knowing how tombstoning works). It wasn't 100% clear to >>>>>>>>> me what to poke to cause compactions to clear the tombstones. First I >>>>>>>>> tried nodetool cleanup on a candidate node. But, afterwards the disk usage >>>>>>>>> was the same. Then I tried nodetool repair on that same node. But again, >>>>>>>>> disk usage is still the same. The CF has no snapshots. >>>>>>>>> >>>>>>>>> So, am I misunderstanding something? Is there another operation >>>>>>>>> to try? Do I have to "just wait"? I've only done cleanup/repair on one >>>>>>>>> node. Do I have to run one or the other over all nodes to clear >>>>>>>>> tombstones? >>>>>>>>> >>>>>>>>> Cassandra 1.2.15 if it matters, >>>>>>>>> >>>>>>>>> Thanks! >>>>>>>>> >>>>>>>>> will >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >>> >>> -- >>> *Paulo Motta* >>> >>> Chaordic | *Platform* >>> *www.chaordic.com.br * >>> +55 48 3232.3200 >>> >>> >>> >> >> >> > -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br * +55 48 3232.3200 --089e0163407482cb1a04f6c7bbec Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
This thread is really informative, thanks for the good fee= dback.=A0

My question is : Is there a way to force tombs= tones to be clared with LCS? Does scrub help in any case? Or the only solut= ion would be to create a new CF and migrate all the data if you intend to d= o a large CF cleanup?

Cheers,


On Fri, Apr 11, 2014 at 2:02 PM, Mark Reddy <mark.re= ddy@boxever.com> wrote:
Thats great Will, if you co= uld update the thread with the actions you decide to take and the results t= hat would be great.


Mark


On Fri, Apr 11, 2014 at 5:53 PM, William Oberman <oberman@civicscie= nce.com> wrote:
I've learned a *lot* from this thread. =A0My thanks to= all of the contributors!

Paulo: Good luck with LCS. =A0= I wish I could help there, but all of my CF's are SizeTiered (mostly as= I'm on the same schema/same settings since 0.7...)

will=A0


On Fri, Apr 11, 2014 at 12:14 PM, Mina Naguib <= span dir=3D"ltr"><mina.naguib@adgear.com> wrote:
Levelled Compaction is a wholly different beast when it comes to tom= bstones.

The tombstones are inserted, like any other write really, at= the lower levels in the leveldb hierarchy.

They a= re only removed after they have had the chance to "naturally" mig= rate upwards in the leveldb hierarchy to the highest level in your data sto= re. =A0How long that takes depends on:
1. The amount of data in = your store and the number of levels your LCS strategy has
2. The amount of new writes entering = the bottom funnel of your leveldb, forcing upwards compaction and combining=

To give you an idea, I had a similar scenario and ran a= (slow, throttled) delete job on my cluster around December-January. =A0Her= e's a graph of the disk space usage on one node. =A0Notice the still-di= clining usage long after the cleanup job has finished (sometime in January)= . =A0I tend to think of tombstones in LCS as little bombs that get to explo= de much later in time:




On 2014-04-11, at 11:20 AM, Paulo Ricardo Motta Gom= es <paulo.motta@chaordicsystems.com> wrote:

I have a similar problem here, I deleted about 30% of a ve= ry large CF using LCS (about 80GB per node), but still my data hasn't s= hrinked, even if I used 1 day for gc_grace_seconds. Would nodetool scrub he= lp? Does nodetool scrub forces a minor compaction?

Cheers,

Paulo
=

On Fri, Apr 11, 2014 at 12:12 PM, Mark R= eddy <mark.reddy@boxever.com> wrote:
Yes, running=A0nodetool compact (major co= mpaction)=A0creates one large SSTable. This will mess up the heurist= ics of the SizeTiered strategy (is this the compaction=A0strategy you are u= sing?) leading to multiple 'small' SSTables alongside the single la= rge SSTable, which results in increased read latency. You will incur the op= erational overhead of having to manage compactions if you wish to compact t= hese smaller SSTables. For all these reasons it is generally advised to sta= y away from running compactions manually.

Assuming that this is a production environment and you want = to keep everything running as smoothly as possible I would reduce the gc_gr= ace on the CF, allow automatic minor compactions to kick in and then increa= se the gc_grace once again after the tombstones have been removed.


On Fri, Apr 11, 2014 at 3:44 PM, William Oberman <<= a href=3D"mailto:oberman@civicscience.com" target=3D"_blank">oberman@civics= cience.com> wrote:
So, if I was impatient= and just "wanted to make this happen now", I could:
1.) Change GCGraceSeconds of the CF to 0
2.) run nodet= ool compact (*)
3.) Change GCGraceSeconds of the CF back to 10 days

Since I have ~900M tombstones, even if I miss a few due to impatience, I = don't care *that* much as I could re-run my clean up tool against the n= ow much smaller CF.

(*) A long long time ago I seem to recall reading advic= e about "don't ever run nodetool compact", but I can't re= member why. =A0Is there any bad long term consequence? =A0Short term there = are several:
-a heavy operation
-temporary 2x disk space
-one b= ig SSTable afterwards
But moving forward, everything is ok right?= =A0CommitLog/MemTable->SStables, minor compactions that merge SSTables,= etc... =A0The only flaw I can think of is it will take forever until the S= STable minor compactions build up enough to consider including the big SSTa= ble in a compaction, making it likely I'll have to self manage compacti= ons.



On Fri, Apr 11, 2014 at 10:31 AM, Mark Reddy <mark.reddy@boxever.co= m> wrote:
Correct, a tombstone will only be removed= after gc_grace period has elapsed. The default value is set to 10 days whi= ch=A0allows a great deal of time for consistency to be achieved prior to de= letion. If you are operationally confident that you can=A0achieve=A0consist= ency via=A0anti-entropy repairs within a shorter period you can always redu= ce that 10 day interval.


Mark


On Fri, Apr 11, 2014 at 3:16 = PM, William Oberman <oberman@civicscience.com> wrote:=
I'm seeing a lot of articles about a = dependency between removing tombstones and=A0GCGraceSeconds, which might be= my problem (I just checked, and this CF has=A0GCGraceSeconds of 10 days).<= div>


On Fri, Apr 11, 2014 at 10:10 AM, tommas= o barbugli <tbarbugli@gmail.com> wrote:
compaction should take care of it; for me it never worked = so I run nodetool compaction on every node; that does it.


2014-04-11 16:05 GMT+02= :00 William Oberman <oberman@civicscience.com>:

I'm wondering what will clear tombsto= ned rows? =A0nodetool cleanup, nodetool repair, or time (as in just wait)?<= div>
I had a CF that was more or less storing session information= . =A0After some time, we decided that one piece of this information was poi= ntless to track (and was 90%+ of the columns, and in 99% of those cases was= ALL columns for a row). =A0 I wrote a process to remove all of those colum= ns (which again in a vast majority of cases had the effect of removing the = whole row).

This CF had ~1 billion rows, so I expect to be left wit= h ~100m rows. =A0After I did this mass delete, everything was the same size= on disk (which I expected, knowing how tombstoning works). =A0It wasn'= t 100% clear to me what to poke to cause compactions to clear the tombstone= s. =A0First I tried nodetool cleanup on a candidate node. =A0But, afterward= s the disk usage was the same. =A0Then I tried nodetool repair on that same= node. =A0But again, disk usage is still the same. =A0The CF has no snapsho= ts. =A0

So, am I misunderstanding something? =A0Is there anothe= r operation to try? =A0Do I have to "just wait"? =A0I've only= done cleanup/repair on one node. =A0Do I have to run one or the other over= all nodes to clear tombstones?=A0

Cassandra 1.2.15 if it matters,

Thanks!

will




<= /div>






--
=
Paulo Motta








--
=
Paulo = Motta

<= div style=3D"font-family:arial,sans-serif;font-size:12.727272033691406px;ba= ckground-color:rgb(255,255,255)">
Chaordic | Platform
www.chaordic.com.br
+55 48 3232.3200
--089e0163407482cb1a04f6c7bbec--