Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: message received from 54.76.25.247 which
 is an MX secondary for user@cassandra.apache.org)
MIME-Version: 1.0
In-Reply-To: 
 <BLUPR08MB12846CF3A9D8AD99C630E3359BEF0@BLUPR08MB1284.namprd08.prod.outlook.com>
References: 
 <BLUPR08MB1284D22E19031FE71E9509E79BEF0@BLUPR08MB1284.namprd08.prod.outlook.com>
	<CAKgmDnEX2e=_aOmpg=H98tuvbSfr7_Kxz7WjjJfxRt6FJ5rnOw@mail.gmail.com>
	<BLUPR08MB12846CF3A9D8AD99C630E3359BEF0@BLUPR08MB1284.namprd08.prod.outlook.com>
Date: Tue, 21 Apr 2015 12:09:13 -0400
Message-ID: 
 <CAKgmDnHk1e9pbyL-Dhu0WtGH-u77W6a2HpJ2hGZ7+4LPNBgRKw@mail.gmail.com>
Subject: Re: Cassandra tombstones being created by updating rows with TTL's
From: "Laing, Michael" <michael.laing@nytimes.com>
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Content-Type: multipart/alternative; boundary=f46d04428da280314605143e445a

--f46d04428da280314605143e445a
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Discussions previously on the list show why this is not a problem in much
more detail.

If something changes in your cluster: node down, new node, etc - you run
repair for sure.

We also run periodic repairs prophylactically.

But if you never delete and always ttl by the same amount, you do not have
to worry about zombie data being resurrected - the main reason for running
repair within gc_grace_seconds.


On Tue, Apr 21, 2015 at 11:49 AM, Walsh, Stephen <Stephen.Walsh@aspect.com>
wrote:

>  Maybe thanks Michael,
>
> I will give these setting a go,
>
> How do you do you periodic node-tool repairs in the situation, for what I
> read we need to start doing this also.
>
>
>
> https://wiki.apache.org/cassandra/Operations#Frequency_of_nodetool_repair
>
>
>
>
>
> *From:* Laing, Michael [mailto:michael.laing@nytimes.com]
> *Sent:* 21 April 2015 16:26
> *To:* user@cassandra.apache.org
> *Subject:* Re: Cassandra tombstones being created by updating rows with
> TTL's
>
>
>
> If you never delete except by ttl, and always write with the same ttl (or
> monotonically increasing), you can set gc_grace_seconds to 0.
>
>
>
> That's what we do. There have been discussions on the list over the last
> few years re this topic.
>
>
>
> ml
>
>
>
> On Tue, Apr 21, 2015 at 11:14 AM, Walsh, Stephen <Stephen.Walsh@aspect.co=
m>
> wrote:
>
>  We were chatting to Jon Haddena about a week ago about our tombstone
> issue using Cassandra 2.0.14
>
> To Summarize
>
>
>
> We have a 3 node cluster with replication-factor=3D3 and compaction =3D
> SizeTiered
>
> We use 1 keyspace with 1 table
>
> Each row have about 40 columns
>
> Each row has a TTL of 10 seconds
>
>
>
> We insert about 500 rows per second in a prepared batch** (about 3mb in
> network overhead)
>
> We query the entire table once per second
>
>
>
> **This is too enable consistent data, E.G batch in transactional, so we
> get all queried data from one insert and not a mix of 2 or more.
>
>
>
>
>
> Seems every second we insert, the rows are never deleted by the TTL, or s=
o
> we thought.
>
> After some time we got this message on the query side
>
>
>
>
>
> #######################################
>
> ERROR [ReadStage:91] 2015-04-21 12:27:03,902 SliceQueryFilter.java (line
> 206) Scanned over 100000 tombstones in keyspace.table; query aborted (see
> tombstone_failure_threshold)
>
> ERROR [ReadStage:91] 2015-04-21 12:27:03,931 CassandraDaemon.java (line
> 199) Exception in thread Thread[ReadStage:91,5,main]
>
> java.lang.RuntimeException:
> org.apache.cassandra.db.filter.TombstoneOverwhelmingException
>
>                 at
> org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StoragePr=
oxy.java:2008)
>
>                 at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java=
:1142)
>
>                 at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.jav=
a:617)
>
>                 at java.lang.Thread.run(Thread.java:745)
>
> Caused by: org.apache.cassandra.db.filter.TombstoneOverwhelmingException
>
> #######################################
>
>
>
>
>
> So we know tombstones are infact being created.
>
> Solution was to change the table schema and set gc_grace_seconds to run
> every 60 seconds.
>
> This worked for 20 seconds, then we saw this
>
>
>
>
>
> #######################################
>
> Read 500 live and 30000 tombstoned cells in keyspace.table (see
> tombstone_warn_threshold). 10000 columns was requested, slices=3D[-],
> delInfo=3D{deletedAt=3D-9223372036854775808, localDeletion=3D2147483647}
>
> #######################################
>
>
>
> So every 20 seconds (500 inserts x 20 seconds =3D 10,000 tombstones)
>
> So now we have the gc_grace_seconds set to 10 seoncds.
>
> But its feels very wrong to have it at a low number, especially if we mov=
e
> to a larger cluster. This just wont fly.
>
> What are we doing wrong?
>
>
>
> We shouldn=E2=80=99t increase the tombstone threshold as that is extremel=
y
> dangerous.
>
>
>
>
>
> Best Regards
>
> Stephen Walsh
>
>
>
>
>
>
>
>
>
>
>
>
>
> This email (including any attachments) is proprietary to Aspect Software,
> Inc. and may contain information that is confidential. If you have receiv=
ed
> this message in error, please do not read, copy or forward this message.
> Please notify the sender immediately, delete it from your system and
> destroy any copies. You may not further disclose or distribute this email
> or its attachments.
>
>
>  This email (including any attachments) is proprietary to Aspect
> Software, Inc. and may contain information that is confidential. If you
> have received this message in error, please do not read, copy or forward
> this message. Please notify the sender immediately, delete it from your
> system and destroy any copies. You may not further disclose or distribute
> this email or its attachments.
>

--f46d04428da280314605143e445a
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Discussions previously on the list show why this is not a =
problem in much more detail.<div><br></div><div>If something changes in you=
r cluster: node down, new node, etc - you run repair for sure.</div><div><b=
r></div><div>We also run periodic repairs prophylactically.</div><div><br><=
/div><div>But if you never delete and always ttl by the same amount, you do=
 not have to worry about zombie data being resurrected - the main reason fo=
r running repair within gc_grace_seconds.</div><div><br></div><div><br></di=
v></div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Tue, A=
pr 21, 2015 at 11:49 AM, Walsh, Stephen <span dir=3D"ltr">&lt;<a href=3D"ma=
ilto:Stephen.Walsh@aspect.com" target=3D"_blank">Stephen.Walsh@aspect.com</=
a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0=
 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<div lang=3D"EN-IE" link=3D"blue" vlink=3D"purple">
<div>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,sans-serif;color:#1f497d">Maybe thanks Michael,
<u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,sans-serif;color:#1f497d">I will give these setting a go,<u></u=
><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,sans-serif;color:#1f497d">How do you do you periodic node-tool =
repairs in the situation, for what I read we need to start doing this also.=
<u></u><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,sans-serif;color:#1f497d"><u></u>=C2=A0<u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,sans-serif;color:#1f497d"><a href=3D"https://wiki.apache.org/ca=
ssandra/Operations#Frequency_of_nodetool_repair" target=3D"_blank">https://=
wiki.apache.org/cassandra/Operations#Frequency_of_nodetool_repair</a><u></u=
><u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,sans-serif;color:#1f497d"><u></u>=C2=A0<u></u></span></p>
<p class=3D"MsoNormal"><span style=3D"font-size:11.0pt;font-family:&quot;Ca=
libri&quot;,sans-serif;color:#1f497d"><u></u>=C2=A0<u></u></span></p>
<p class=3D"MsoNormal"><b><span lang=3D"EN-US" style=3D"font-size:11.0pt;fo=
nt-family:&quot;Calibri&quot;,sans-serif">From:</span></b><span lang=3D"EN-=
US" style=3D"font-size:11.0pt;font-family:&quot;Calibri&quot;,sans-serif"> =
Laing, Michael [mailto:<a href=3D"mailto:michael.laing@nytimes.com" target=
=3D"_blank">michael.laing@nytimes.com</a>]
<br>
<b>Sent:</b> 21 April 2015 16:26<br>
<b>To:</b> <a href=3D"mailto:user@cassandra.apache.org" target=3D"_blank">u=
ser@cassandra.apache.org</a><br>
<b>Subject:</b> Re: Cassandra tombstones being created by updating rows wit=
h TTL&#39;s<u></u><u></u></span></p><div><div class=3D"h5">
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
<div>
<p class=3D"MsoNormal">If you never delete except by ttl, and always write =
with the same ttl (or monotonically increasing), you can set gc_grace_secon=
ds to 0.<u></u><u></u></p>
<div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
</div>
<div>
<p class=3D"MsoNormal">That&#39;s what we do. There have been discussions o=
n the list over the last few years re this topic.<u></u><u></u></p>
</div>
<div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
</div>
<div>
<p class=3D"MsoNormal">ml<u></u><u></u></p>
</div>
</div>
<div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
<div>
<p class=3D"MsoNormal">On Tue, Apr 21, 2015 at 11:14 AM, Walsh, Stephen &lt=
;<a href=3D"mailto:Stephen.Walsh@aspect.com" target=3D"_blank">Stephen.Wals=
h@aspect.com</a>&gt; wrote:<u></u><u></u></p>
<blockquote style=3D"border:none;border-left:solid #cccccc 1.0pt;padding:0c=
m 0cm 0cm 6.0pt;margin-left:4.8pt;margin-right:0cm">
<div>
<div>
<p class=3D"MsoNormal">We were chatting to Jon Haddena about a week ago abo=
ut our tombstone issue using Cassandra 2.0.14
<u></u><u></u></p>
<p class=3D"MsoNormal">To Summarize<u></u><u></u></p>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
<p class=3D"MsoNormal">We have a 3 node cluster with replication-factor=3D3=
 and compaction =3D SizeTiered<u></u><u></u></p>
<p class=3D"MsoNormal">We use 1 keyspace with 1 table<u></u><u></u></p>
<p class=3D"MsoNormal">Each row have about 40 columns<u></u><u></u></p>
<p class=3D"MsoNormal">Each row has a TTL of 10 seconds<u></u><u></u></p>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
<p class=3D"MsoNormal">We insert about 500 rows per second in a prepared ba=
tch** (about 3mb in network overhead)<u></u><u></u></p>
<p class=3D"MsoNormal">We query the entire table once per second<u></u><u><=
/u></p>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
<p class=3D"MsoNormal">**This is too enable consistent data, E.G batch in t=
ransactional, so we get all queried data from one insert and not a mix of 2=
 or more.
<u></u><u></u></p>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
<p class=3D"MsoNormal">Seems every second we insert, the rows are never del=
eted by the TTL, or so we thought.<u></u><u></u></p>
<p class=3D"MsoNormal">After some time we got this message on the query sid=
e<u></u><u></u></p>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
<p class=3D"MsoNormal">#######################################<u></u><u></u=
></p>
<p class=3D"MsoNormal">ERROR [ReadStage:91] 2015-04-21 12:27:03,902 SliceQu=
eryFilter.java (line 206) Scanned over 100000 tombstones in keyspace.table;=
 query aborted (see tombstone_failure_threshold)<u></u><u></u></p>
<p class=3D"MsoNormal">ERROR [ReadStage:91] 2015-04-21 12:27:03,931 Cassand=
raDaemon.java (line 199) Exception in thread Thread[ReadStage:91,5,main]<u>=
</u><u></u></p>
<p class=3D"MsoNormal">java.lang.RuntimeException: org.apache.cassandra.db.=
filter.TombstoneOverwhelmingException<u></u><u></u></p>
<p class=3D"MsoNormal">=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at org.apache.cassandra.service.Sto=
rageProxy$DroppableRunnable.run(StorageProxy.java:2008)<u></u><u></u></p>
<p class=3D"MsoNormal">=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at java.util.concurrent.ThreadPoolE=
xecutor.runWorker(ThreadPoolExecutor.java:1142)<u></u><u></u></p>
<p class=3D"MsoNormal">=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at java.util.concurrent.ThreadPoolE=
xecutor$Worker.run(ThreadPoolExecutor.java:617)<u></u><u></u></p>
<p class=3D"MsoNormal">=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 at java.lang.Thread.run(Thread.java=
:745)<u></u><u></u></p>
<p class=3D"MsoNormal">Caused by: org.apache.cassandra.db.filter.TombstoneO=
verwhelmingException<u></u><u></u></p>
<p class=3D"MsoNormal">#######################################<u></u><u></u=
></p>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
<p class=3D"MsoNormal">So we know tombstones are infact being created.<u></=
u><u></u></p>
<p class=3D"MsoNormal">Solution was to change the table schema and set gc_g=
race_seconds to run every 60 seconds.<u></u><u></u></p>
<p class=3D"MsoNormal">This worked for 20 seconds, then we saw this<u></u><=
u></u></p>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
<p class=3D"MsoNormal">#######################################<u></u><u></u=
></p>
<p class=3D"MsoNormal">Read 500 live and 30000 tombstoned cells in keyspace=
.table (see tombstone_warn_threshold). 10000 columns was requested, slices=
=3D[-], delInfo=3D{deletedAt=3D-9223372036854775808, localDeletion=3D<a hre=
f=3D"tel:2147483647" target=3D"_blank">2147483647</a>}<u></u><u></u></p>
<p class=3D"MsoNormal">#######################################<u></u><u></u=
></p>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
<p class=3D"MsoNormal">So every 20 seconds (500 inserts x 20 seconds =3D 10=
,000 tombstones)<u></u><u></u></p>
<p class=3D"MsoNormal">So now we have the gc_grace_seconds set to 10 seoncd=
s.<u></u><u></u></p>
<p class=3D"MsoNormal">But its feels very wrong to have it at a low number,=
 especially if we move to a larger cluster. This just wont fly.<u></u><u></=
u></p>
<p class=3D"MsoNormal">What are we doing wrong?<u></u><u></u></p>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
<p class=3D"MsoNormal">We shouldn=E2=80=99t increase the tombstone threshol=
d as that is extremely dangerous.<u></u><u></u></p>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
<p class=3D"MsoNormal">Best Regards<u></u><u></u></p>
<p class=3D"MsoNormal">Stephen Walsh<u></u><u></u></p>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
<p class=3D"MsoNormal">=C2=A0<u></u><u></u></p>
</div>
<p class=3D"MsoNormal">This email (including any attachments) is proprietar=
y to Aspect Software, Inc. and may contain information that is confidential=
. If you have received this message in error, please do not read, copy or f=
orward this message. Please notify
 the sender immediately, delete it from your system and destroy any copies.=
 You may not further disclose or distribute this email or its attachments.
<u></u><u></u></p>
</div>
</blockquote>
</div>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
</div>
</div></div></div><div><div class=3D"h5">
This email (including any attachments) is proprietary to Aspect Software, I=
nc. and may contain information that is confidential. If you have received =
this message in error, please do not read, copy or forward this message. Pl=
ease notify the sender immediately,
 delete it from your system and destroy any copies. You may not further dis=
close or distribute this email or its attachments.
</div></div></div>

</blockquote></div><br></div>

--f46d04428da280314605143e445a--