Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
MIME-Version: 1.0
In-Reply-To: <CACcCbPCD8+A1isj2h4xOW001c=L92-1L=9qeF5AWGvbBtVjdkg@mail.gmail.com>
References: <CACcCbPDBDJtXOJ9HK8C+E8Js+yvjontGwkv89Vg654K51C3z5g@mail.gmail.com>
 <CALcD3Ps-yGP3xbubOBc0yK2fTyECZK5itJAeO+Bk2s86mFaQkw@mail.gmail.com>
 <CACcCbPD=2BBedC6-MK0WXonZ0GHRwyiM3gjMKEskkoHiFv5=2A@mail.gmail.com>
 <CACcCbPCwv4w+AxaLMh962N7ewdcWXqA9=zCkfSXXCcsBd_s8Wg@mail.gmail.com>
 <CAE1LR78XLikGPKOLFnX4EGsx1zOQA4_9hbjdi6S-jiymG39Yfg@mail.gmail.com>
 <CACcCbPA_kgah8--mRtJBVACGoECm_=ze0ObfaWvOs28Lvy4rCw@mail.gmail.com>
 <CA+VSrLrF1Ta1zwFsHET-CGFFitHyUoEADrtrBu8A+bEOMUSTBg@mail.gmail.com>
 <CACcCbPA4kGJJmEeVfxi=jX_YNXKvHN4NxZ8yN-iucmqLsBHagg@mail.gmail.com>
 <CA+VSrLrQy8RqQKoV-=jCh=4Q1rSVvvnVT5=Qpg6WwWMuOaZr9A@mail.gmail.com> <CACcCbPCD8+A1isj2h4xOW001c=L92-1L=9qeF5AWGvbBtVjdkg@mail.gmail.com>
From: Alain RODRIGUEZ <arodrime@gmail.com>
Date: Thu, 11 May 2017 22:43:58 +0100
Message-ID: <CA+VSrLoL3x6NiKeRjG=eg5t+Wv=9XVtzSXq3SgLbKowgKx+p8Q@mail.gmail.com>
Subject: Re: Drop tables takes too long
To: Bohdan Tantsiura <bohdantan@gmail.com>
Cc: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary="f4030435bb00b08852054f467db6"
archived-at: Thu, 11 May 2017 21:44:32 -0000

--f4030435bb00b08852054f467db6
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Hi

We were trying to overcome OOM crashes.


Fair enough :-).

We changed settings to default on one node. GC times became about two times
> smaller on that node.
>

That's encouraging! Looks like even if the number of tables is really high,
there is still space for optimization. Have you made the change on the
entire cluster by now? How are things going?

Also you can continue to play around with a canary node to find the sweet
spot, the right tuning for your use case and hardware. Taking a day to do
this is sometimes very worth it ;-).

Number of sstables is not constant. During about 2.5 hours number of tables
> was changed for 26 tables, e.g. [4, 4, 4, 4, 4, 4, 4, 4, 4, 4] =3D> [6, 6=
,
> 6, 6, 6, 6, 6, 6, 6, 6] or [4, 7, 7, 7, 5, 5, 4, 4, 4, 4] =3D> [4, 4, 4, =
4,
> 4, 4, 4, 4, 4, 4] (each list is number of sstables on each of 10 nodes fo=
r
> one table).
> Number of sstables is balanced for almost all tables. But for some tables
> number of sstables is not really balanced, like [11, 2, 4, 4, 4, 2, 2, 2,
> 2, 5] or [439, 558, 346, 521, 490, 553, 500, 515, 522, 495]
>

Sounds like reasonable numbers of SSTables. Imbalances are not that big. I
would say compaction is running normally

 We run incremental repairs

 We use LCS for all tables and MVs. We don't do manual compactions (or
> trigger any anti-compactions)


If you run incremental repairs, it means you trigger anti-compactions.
Actually the first repair might completely have produce the increase in the
number of SSTables. If it was not the first time, you might have run into
one of the corner cases still not fixed on incremental repairs.

Also, using LCS for such a high number of tables is probably putting some
pressure on disks. Is it a real need? Would STCS or TWCS not be a better
fit on some of the table? That being said, compactions look good. Are you
seeing pending compactions under standard conditions or on peak hours?

 What number of MemtableFlushWriter are you using?
>
> We do not specify if, so default is use
>

https://github.com/apache/cassandra/blob/cassandra-3.10/conf/cassandra.yaml=
#L538

So it is 2. If disks IO is not a bottleneck you might want to consider
increasing this number to 4, it should be a safe value. To know if Flush
Writers is an issue, use 'watch -d nodetool tpstats' and see if there are
any 'FlushWriter' pending threads.

One column for each node
> READ:               2609 7   0   0   2   1   2   0   1   1
>

The only non negligible value is about read being dropped and on only one
node. If this value is not growing anymore, you might have faced a punctual
issue.

This cluster looks relatively healthy excepted for GC activity (that can
explain read drops). I would persevere on GC tuning, continuing to monitor
things you have been sharing with us so far, to see how it evolves. Having
a closer look at repairs impact might be worth it as well. Good luck!

C*heers,
-----------------------
Alain Rodriguez - @arodream - alain@thelastpickle.com
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2017-05-08 14:21 GMT+01:00 Bohdan Tantsiura <bohdantan@gmail.com>:

> Hi,
>
> > Why did you move from defaults that much?
> We were trying to overcome OOM crashes.
>
> > Would you consider giving defaults a try on a canary node and monitor /
> compare GC times to other nodes?
> We changed settings to default on one node. GC times became about two
> times smaller on that node.
>
> > What do you mean from time to time? For how long are this task pending,
> what frequency is this happening?
>  CompactionExecutor pending tasks appeared on nodes once during 3 or more
> hours. Tasks were pending during about 5-15 minutes.
>
> > Is the number of sstables constant and balanced between nodes?
> Number of sstables is not constant. During about 2.5 hours number of
> tables was changed for 26 tables, e.g. [4, 4, 4, 4, 4, 4, 4, 4, 4, 4] =3D=
>
> [6, 6, 6, 6, 6, 6, 6, 6, 6, 6] or [4, 7, 7, 7, 5, 5, 4, 4, 4, 4] =3D> [4,
> 4, 4, 4, 4, 4, 4, 4, 4, 4] (each list is number of sstables on each of 10
> nodes for one table).
> Number of sstables is balanced for almost all tables. But for some tables
> number of sstables is not really balanced, like [11, 2, 4, 4, 4, 2, 2, 2,
> 2, 5] or [439, 558, 346, 521, 490, 553, 500, 515, 522, 495]
>
> > Also do you run full or incremental repairs?
> We run incremental repairs
>
> > Do you use LCS or do some manual compactions (or trigger any
> anti-compactions)?
> We use LCS for all tables and MVs. We don't do manual compactions (or
> trigger any anti-compactions)
>
> > How is CPU doing, is there any burst in CPU that could be related to
> these errors?
> Unfortunately, stats for period when there were InternalResponseState
> pending tasks is lost.
>
> > What number of MemtableFlushWriter are you using
> We do not specify if, so default is used
>
> > What tasks were dropped
> One column for each node
> READ:               2609 7   0   0   2   1   2   0   1   1
> HINT:               0    0   0   0   1   0   1   0   0   0
> MUTATION:           0    0   1   0   0   0   0   0   0   0
> REQUEST_RESPONSE:   0    0   0   0   2   0   0   0   0   1
>
> Thanks
>
> 2017-05-03 17:00 GMT+03:00 Alain RODRIGUEZ <arodrime@gmail.com>:
>
>> Hi,
>>
>> A few comments:
>>
>> Long GC Pauses take about one minute
>>>
>>
>> This is huge. About JVM config, I haven't played much with G1GC, but the
>> following seems to be a bad idea according to comments:
>>
>> ## Main G1GC tunable: lowering the pause target will lower throughput an=
d
>> vise versa.
>> ## 200ms is the JVM default and lowest viable setting
>> ## 1000ms increases throughput. Keep it smaller than the timeouts in
>> cassandra.yaml.
>> -XX:MaxGCPauseMillis=3D15000
>>
>> # Save CPU time on large (>=3D 16GB) heaps by delaying region scanning
>> # until the heap is 70% full. The default in Hotspot 8u40 is 40%.
>> -XX:InitiatingHeapOccupancyPercent=3D30
>> # For systems with > 8 cores, the default ParallelGCThreads is 5/8 the
>> number of logical cores.
>> # Otherwise equal to the number of cores when 8 or less.
>> # Machines with > 10 cores should try setting these to <=3D full cores.
>> -XX:ParallelGCThreads=3D8
>> # By default, ConcGCThreads is 1/4 of ParallelGCThreads.
>> # Setting both to the same value can reduce STW durations.
>> -XX:ConcGCThreads=3D8
>>
>>
>> Why did you move from defaults that much? Would you consider
>> giving defaults a try on a canary node and monitor / compare GC times to
>> other nodes?
>>
>> 1) ColumnFamilyStore.java:542 - Failed unregistering mbean:
>>> org.apache.cassandra.db:type=3DTables,keyspace=3D...,table=3D...
>>>  from MigrationStage thread
>>>
>>
>> I am not sure about this one... :p
>>
>> 2) Read 1715 live rows and 1505 tombstone cells for query ...
>>>  from ReadStage thread
>>
>>
>> Half of what was read for this query was deleted data. With obvious disk
>> space, disk throughput and latency consequences. This is an entire topic=
...
>> Here is what I know about it: thelastpickle.com/blog/2016/07
>> /27/about-deletes-and-tombstones.html, I hope it will help you solving
>> your issue.
>>
>> 3) GCInspector.java:282 - G1 Young Generation GC in 1725ms
>>
>>
>> This might be related to your GC configuration or some other issues
>> mentioned in your last mail.
>>
>>  About 3000-6000 CompactionExecutor pending tasks appeared on all nodes
>>> from time to time.
>>
>>
>> Hum that's weird. It's huge, but as you have so many tables I am not
>> sure, it might be a 'normal' issue when running with so many tables and =
MVs.
>>
>> What do you mean from time to time? For how long are this task pending,
>> what frequency is this happening?
>>
>> Is the number of sstables constant and balanced between nodes?
>> Also do you run full or incremental repairs?
>> Do you use LCS or do some manual compactions (or trigger any
>> anti-compactions)?
>>
>>
>>> About 1000 MigrationStage pending tasks appeared on 2 nodes.
>>
>>
>> That's pending writes. Meaning this Cassandra node can't cope with what
>> what is thrown at it. It can be related to pending flushes (blocking
>> writes), huge Garbage Collection (Stop The World, including writes), due=
 to
>> hardware limits (CPU busy with compactions?) or even to a too conservati=
ve
>> configuration of the concurrent_write.
>>
>>
>>> About 700 InternalResponseState pending tasks appeared on 2 nodes.
>>
>>
>> I never had issues with this one and so didn't knew much about it. But
>> according to Chris Lohfink in this post https://www.pythian.com/blog/g
>> uide-to-cassandra-thread-pools/#InternalResponseStage, this thread pool
>> is responsible for "Responding to non-client initiated messages, includi=
ng
>> bootstrapping and schema checking". Which again might be related with th=
e
>> huge number of tables in the cluster. How is CPU doing, is there any bur=
st
>> in CPU that could be related to these errors?
>>
>> About 60 MemtableFlushWriter appeared on 3 nodes.
>>
>>
>> What number of MemtableFlushWriter are you using. Consider increasing it
>> (or maybe the memtable size).
>>
>>
>>> There were no blocked tasks, but there were "All time blocked" tasks
>>> (they were before starting dropping tables) from 3 millions to 20 milli=
ons
>>> on different nodes.
>>
>>
>> What tasks were dropped.
>>
>> The cluster doesn't look completely healthy, but I believe it is possibl=
e
>> to improve things, before thinking about splitting tables in multiples
>> cluster. I would definitely not add more tables though...
>>
>> C*heers,
>> -----------------------
>> Alain Rodriguez - @arodream - alain@thelastpickle.com
>> France
>>
>> The Last Pickle - Apache Cassandra Consulting
>> http://www.thelastpickle.com
>>
>> 2017-04-28 14:35 GMT+01:00 Bohdan Tantsiura <bohdantan@gmail.com>:
>>
>>> Thanks Alain,
>>>
>>> > Or is it on happening during drop table actions?
>>> Some other schema changes (e.g. adding columns to tables) also takes to=
o
>>> much time.
>>>
>>> Link to complete set of GC options: https://pastebin.com/4qyENeyu
>>>
>>> > Have you had a look at logs, mainly errors and warnings?
>>> In logs I found warnings of 3 types:
>>> 1) ColumnFamilyStore.java:542 - Failed unregistering mbean:
>>> org.apache.cassandra.db:type=3DTables,keyspace=3D...,table=3D...
>>>  from MigrationStage thread
>>> 2) Read 1715 live rows and 1505 tombstone cells for query ...
>>>  from ReadStage thread
>>> 3) GCInspector.java:282 - G1 Young Generation GC in 1725ms.  G1 Eden
>>> Space: 38017171456 -> 0; G1 Survivor Space: 2516582400
>>> <(251)%20658-2400> -> 2650800128; from Service Thread
>>>
>>> > Are they some pending, blocked or dropped tasks in thread pool stats?
>>> About 3000-6000 CompactionExecutor pending tasks appeared on all nodes
>>> from time to time. About 1000 MigrationStage pending tasks appeared on =
2
>>> nodes. About 700 InternalResponseState pending tasks appeared on 2 node=
s.
>>> About 60 MemtableFlushWriter appeared on 3 nodes.
>>> There were no blocked tasks, but there were "All time blocked" tasks
>>> (they were before starting dropping tables) from 3 millions to 20 milli=
ons
>>> on different nodes.
>>>
>>> > Are some resources constraint (CPU / disk IO,...)?
>>> CPU and disk IO are not constraint
>>>
>>> Thanks
>>>
>>> 2017-04-27 11:10 GMT+03:00 Alain RODRIGUEZ <arodrime@gmail.com>:
>>>
>>>> Hi
>>>>
>>>>
>>>>> Long GC Pauses take about one minute. But why it takes so much time
>>>>> and how that can be fixed?
>>>>
>>>>
>>>> This is very long. Looks like you are having a major issue, and it is
>>>> not just about dropping tables... Or is it on happening during drop ta=
ble
>>>> actions? Knowing the complete set of GC options in use could help here=
,
>>>> could you paste it here (or link to it)?
>>>>
>>>> Also, GC is often high as a consequence of other issues and not only
>>>> when 'badly=E2=80=98 tuned
>>>>
>>>>
>>>>    - Have you had a look at logs, mainly errors and warnings?
>>>>
>>>>    $ grep -e "ERROR" -e "WARN" /var/log/cassandra/system.log
>>>>
>>>>    - Are they some pending, blocked or dropped tasks in thread pool
>>>>    stats?
>>>>
>>>>    $ watch -d nodetool tpstats
>>>>
>>>>    - Are some resources constraint (CPU / disk IO,...)?
>>>>
>>>>
>>>> We have about 60 keyspaces with about 80 tables in each keyspace
>>>>
>>>> In each keyspace we also have 11 MVs
>>>>
>>>>
>>>> Even if I believe we can dig it and maybe improve things, I agree with
>>>> Carlos, this is a lot of Tables (4880) and even more a high number of =
MV
>>>> (660). It might be interesting splitting it somehow if possible.
>>>>
>>>> Cannot achieve consistency level ALL
>>>>
>>>>
>>>> Finally you could try to adjust the corresponding request timeout (not
>>>> sure if it is the global one or the truncate timeout), so it may succe=
ed
>>>> even when nodes are having minutes GC, but it is a workaround as this
>>>> minute GC will most definitely be an issue for the client queries runn=
ing
>>>> (default is 10 sec timeout, so many query are probably failing).
>>>>
>>>> C*heers,
>>>> -----------------------
>>>> Alain Rodriguez - @arodream - alain@thelastpickle.com
>>>> France
>>>>
>>>> The Last Pickle - Apache Cassandra Consulting
>>>> http://www.thelastpickle.com
>>>>
>>>> 2017-04-25 13:58 GMT+02:00 Bohdan Tantsiura <bohdantan@gmail.com>:
>>>>
>>>>> Thanks Zhao Yang,
>>>>>
>>>>> > Could you try some jvm tool to find out which thread are allocating
>>>>> memory or gc? maybe the migration stage thread..
>>>>>
>>>>> I use Cassandra Cluster Manager to locally reproduce the issue. I
>>>>> tried to use VisualVM to find out which threads are allocating
>>>>> memory, but VisualVM does not see cassandra processes and says
>>>>> "Cannot open application with pid". Then I tried to use YourKit Java
>>>>> Profiler. It created snapshot when process of one cassandra node fail=
ed.
>>>>> http://i.imgur.com/9jBcjcl.png - how CPU is used by threads.
>>>>> http://i.imgur.com/ox5Sozy.png - how memory is used by threads, but
>>>>> biggest part of memory is used by objects without allocation informat=
ion.
>>>>> http://i.imgur.com/oqx9crX.png - which objects use biggest part of
>>>>> memory. Maybe you know some other good jvm tool that can show by whic=
h
>>>>> threads biggest part of memory is used?
>>>>>
>>>>> > BTW, is your cluster under high load while dropping table?
>>>>>
>>>>> LA5 was <=3D 5 on all nodes almost all time while dropping tables
>>>>>
>>>>> Thanks
>>>>>
>>>>> 2017-04-21 19:49 GMT+03:00 Jasonstack Zhao Yang <
>>>>> zhaoyangsingapore@gmail.com>:
>>>>>
>>>>>> Hi Bohdan, Carlos,
>>>>>>
>>>>>> Could you try some jvm tool to find out which thread are allocating
>>>>>> memory or gc? maybe the migration stage thread..
>>>>>>
>>>>>> BTW, is your cluster under high load while dropping table?
>>>>>>
>>>>>> As far as I remember, in older c* version, it applies the schema
>>>>>> mutation in memory, ie. DROP, then flush all schema info into sstabl=
e, then
>>>>>> reads all on disk schema into memory (5k tables info + related colum=
n
>>>>>> info)..
>>>>>>
>>>>>> > You also might need to increase the node count if you're resource
>>>>>> constrained.
>>>>>>
>>>>>> More nodes won't help and most probably make it worse due to
>>>>>> coordination.
>>>>>>
>>>>>>
>>>>>> Zhao Yang
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, 21 Apr 2017 at 21:10 Bohdan Tantsiura <bohdantan@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Problem is still not solved. Does anybody have any idea what to do
>>>>>>> with it?
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> 2017-04-20 15:05 GMT+03:00 Bohdan Tantsiura <bohdantan@gmail.com>:
>>>>>>>
>>>>>>>> Thanks Carlos,
>>>>>>>>
>>>>>>>> In each keyspace we also have 11 MVs.
>>>>>>>>
>>>>>>>> It is impossible to reduce number of tables now. Long GC Pauses
>>>>>>>> take about one minute. But why it takes so much time and how that =
can be
>>>>>>>> fixed?
>>>>>>>>
>>>>>>>> Each node in cluster has 128GB RAM, so resources are not
>>>>>>>> constrained now
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>> 2017-04-20 13:18 GMT+03:00 Carlos Rolo <rolo@pythian.com>:
>>>>>>>>
>>>>>>>>> You have 4800 Tables in total? That is a lot of tables, plus MVs?
>>>>>>>>> or MVs are already considered in the 60*80 account?
>>>>>>>>>
>>>>>>>>> I would recommend to reduce the table number. Other thing is that
>>>>>>>>> you need to check your log file for GC Pauses, and how long those=
 pauses
>>>>>>>>> take.
>>>>>>>>>
>>>>>>>>> You also might need to increase the node count if you're resource
>>>>>>>>> constrained.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> Carlos Juzarte Rolo
>>>>>>>>> Cassandra Consultant / Datastax Certified Architect / Cassandra M=
VP
>>>>>>>>>
>>>>>>>>> Pythian - Love your data
>>>>>>>>>
>>>>>>>>> rolo@pythian | Twitter: @cjrolo | Skype: cjr2k3 | Linkedin:
>>>>>>>>> *linkedin.com/in/carlosjuzarterolo
>>>>>>>>> <http://linkedin.com/in/carlosjuzarterolo>*
>>>>>>>>> Mobile: +351 918 918 100 <+351%20918%20918%20100>
>>>>>>>>> www.pythian.com
>>>>>>>>>
>>>>>>>>> On Thu, Apr 20, 2017 at 11:10 AM, Bohdan Tantsiura <
>>>>>>>>> bohdantan@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> We are using cassandra 3.10 in a 10 nodes cluster with
>>>>>>>>>> replication =3D 3. MAX_HEAP_SIZE=3D64GB on all nodes, G1 GC is u=
sed. We have
>>>>>>>>>> about 60 keyspaces with about 80 tables in each keyspace. We had=
 to delete
>>>>>>>>>> three tables and two materialized views from each keyspace. It b=
egan to
>>>>>>>>>> take more and more time for each next keyspace (for some keyspac=
es it took
>>>>>>>>>> about 30 minutes) and then failed with "Cannot achieve consisten=
cy level
>>>>>>>>>> ALL". After restarting the same repeated. It seems that cassandr=
a hangs on
>>>>>>>>>> GC. How that can be solved?
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>
>

--f4030435bb00b08852054f467db6
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>Hi</div><div><br></div><blockquote class=3D"gmail_quo=
te" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204=
);padding-left:1ex">We were trying to overcome OOM crashes.</blockquote><di=
v><br></div><div>Fair enough :-).</div><div><br></div><blockquote class=3D"=
gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(20=
4,204,204);padding-left:1ex"><div style=3D"font-size:12.8px"><span style=3D=
"font-size:12.8px">We changed settings to default on one node. GC times bec=
ame about two times smaller on that node.</span></div></blockquote><div><br=
></div><div>That&#39;s encouraging! Looks like even if the number of tables=
 is really high, there is still space for optimization. Have you made the c=
hange on the entire cluster by now? How are things going?</div><div><br></d=
iv><div>Also you can continue to play around with a canary node to find the=
 sweet spot, the right tuning for your use case and hardware. Taking a day =
to do this is sometimes very worth it ;-).</div><div><br></div><blockquote =
class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px sol=
id rgb(204,204,204);padding-left:1ex"><div style=3D"font-size:12.8px"><span=
 style=3D"font-size:12.8px">Number of sstables is not constant. During abou=
t 2.5 hours number of tables was changed for 26 tables, e.g.=C2=A0</span><s=
pan style=3D"font-size:12.8px">[4, 4, 4, 4, 4, 4, 4, 4, 4, 4] =3D&gt; [6, 6=
, 6, 6, 6, 6, 6, 6, 6, 6] or=C2=A0</span><span style=3D"font-size:12.8px">[=
4, 7, 7, 7, 5, 5, 4, 4, 4, 4] =3D&gt; [4, 4, 4, 4, 4, 4, 4, 4, 4, 4] (each =
list is number of sstables on each of 10 nodes for one table).=C2=A0</span>=
</div><div style=3D"font-size:12.8px"><span style=3D"font-size:12.8px">Numb=
er of sstables is balanced for almost all tables. But for some tables numbe=
r of sstables is not really balanced, like [11, 2, 4, 4, 4, 2, 2, 2, 2, 5] =
or=C2=A0[439, 558, 346, 521, 490, 553, 500, 515, 522, 495]</span></div></bl=
ockquote><div><br></div><div>Sounds like reasonable numbers of SSTables. Im=
balances are not that big. I would say compaction is running normally</div>=
<div><br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0p=
x 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">=C2=A0<spa=
n style=3D"font-size:12.8px">We run incremental repairs</span></blockquote>=
<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left:1px solid rgb(204,204,204);padding-left:1ex">=C2=A0<span style=3D"font=
-size:12.8px">We use=C2=A0</span><span style=3D"font-size:12.8px">LCS for a=
ll tables and MVs. We don&#39;t do=C2=A0</span><span style=3D"font-size:12.=
8px">manual compactions (or trigger any anti-compactions)</span></blockquot=
e><div><br>If you run incremental repairs, it means you trigger anti-compac=
tions. Actually the first repair might completely have produce the increase=
 in the number of SSTables. If it was not the first time, you might have ru=
n into one of the corner cases still not fixed on incremental repairs.</div=
><div><br></div><div>Also, using LCS for such a high number of tables is pr=
obably putting some pressure on disks. Is it a real need? Would STCS or TWC=
S not be a better fit on some of the table? That being said, compactions lo=
ok good. Are you seeing pending compactions under standard conditions or on=
 peak hours?</div><div><br></div><blockquote class=3D"gmail_quote" style=3D=
"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-le=
ft:1ex"><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex=
;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span class=3D"gm=
ail-im" style=3D"font-size:12.8px"><span style=3D"font-size:12.8px">=C2=A0<=
/span><span style=3D"font-size:12.8px">What number of=C2=A0</span><span sty=
le=3D"font-size:12.8px">MemtableFlushWriter are you using?</span></span></b=
lockquote><div style=3D"font-size:12.8px"><span style=3D"font-size:12.8px">=
We do not specify if, so default is use</span></div></blockquote><div><br><=
a href=3D"https://github.com/apache/cassandra/blob/cassandra-3.10/conf/cass=
andra.yaml#L538">https://github.com/apache/cassandra/blob/cassandra-3.10/co=
nf/cassandra.yaml#L538</a><br><br></div><div>So it is 2. If disks IO is not=
 a bottleneck you might want to consider increasing this number to 4, it sh=
ould be a safe value. To know if Flush Writers is an issue, use &#39;watch =
-d nodetool tpstats&#39; and see if there are any &#39;FlushWriter&#39; pen=
ding threads.</div><div><br></div><blockquote class=3D"gmail_quote" style=
=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding=
-left:1ex"><div style=3D"font-size:12.8px"><span style=3D"font-size:12.8px"=
>One column for each node</span></div><div style=3D"font-size:12.8px"><font=
 face=3D"monospace, monospace">READ: =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 2609 7 =C2=A0 0 =C2=A0 0 =C2=A0 2 =C2=A0 1 =C2=A0 2 =C2=A0 0 =C2=
=A0 1 =C2=A0 1</font></div></blockquote><div><br></div><div>The only non ne=
gligible value is about read being dropped and on only one node. If this va=
lue is not growing anymore, you might have faced a punctual issue.</div><di=
v><br></div><div>This cluster looks relatively healthy excepted for GC acti=
vity (that can explain read drops). I would persevere on GC tuning, continu=
ing to monitor things you have been sharing with us so far, to see how it e=
volves. Having a closer look at repairs impact might be worth it as well. G=
ood luck!</div><div><br></div><div>C*heers,</div><div>---------------------=
--</div><div>Alain Rodriguez - @arodream - <a href=3D"mailto:alain@thelastp=
ickle.com">alain@thelastpickle.com</a></div><div>France</div><div><br></div=
><div>The Last Pickle - Apache Cassandra Consulting</div><div><a href=3D"ht=
tp://www.thelastpickle.com">http://www.thelastpickle.com</a></div></div><di=
v class=3D"gmail_extra"><br><div class=3D"gmail_quote">2017-05-08 14:21 GMT=
+01:00 Bohdan Tantsiura <span dir=3D"ltr">&lt;<a href=3D"mailto:bohdantan@g=
mail.com" target=3D"_blank">bohdantan@gmail.com</a>&gt;</span>:<br><blockqu=
ote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc s=
olid;padding-left:1ex"><div dir=3D"ltr">Hi,<span class=3D""><div><br></div>=
<div>&gt;=C2=A0<span style=3D"font-size:12.8px">Why did you move from defau=
lts that much?</span></div></span><div><span style=3D"font-size:12.8px">We =
were trying to overcome OOM crashes.</span><br></div><span class=3D""><div>=
<span style=3D"font-size:12.8px"><br></span></div><div><span style=3D"font-=
size:12.8px">&gt;=C2=A0</span><span style=3D"font-size:12.8px">Would you co=
nsider giving=C2=A0defaults a try on a canary node and monitor / compare GC=
 times to other nodes?</span></div></span><div><span style=3D"font-size:12.=
8px">We changed settings to default on one node. GC times became about two =
times smaller on that node.</span></div><span class=3D""><div><span style=
=3D"font-size:12.8px"><br></span></div><div><span style=3D"font-size:12.8px=
">&gt;=C2=A0</span><span style=3D"font-size:12.8px">What do you mean from t=
ime to time? For how long are this task pending, what frequency is this hap=
pening?</span></div></span><div><span style=3D"font-size:12.8px">=C2=A0Comp=
actionExecutor pending tasks appeared on nodes once during 3 or more hours.=
 Tasks were pending during about 5-15 minutes.</span><br></div><span class=
=3D""><div><span style=3D"font-size:12.8px"><br></span></div><div><span sty=
le=3D"font-size:12.8px">&gt;=C2=A0</span><span style=3D"font-size:12.8px">I=
s the number of sstables constant and balanced between nodes?</span></div><=
/span><div><span style=3D"font-size:12.8px">Number of sstables is not const=
ant. During about 2.5 hours number of tables was changed for 26 tables, e.g=
.=C2=A0</span><span style=3D"font-size:12.8px">[4, 4, 4, 4, 4, 4, 4, 4, 4, =
4] =3D&gt; [6, 6, 6, 6, 6, 6, 6, 6, 6, 6] or=C2=A0</span><span style=3D"fon=
t-size:12.8px">[4, 7, 7, 7, 5, 5, 4, 4, 4, 4] =3D&gt; [4, 4, 4, 4, 4, 4, 4,=
 4, 4, 4] (each list is number of sstables on each of 10 nodes for one tabl=
e).=C2=A0</span></div><div><span style=3D"font-size:12.8px">Number of sstab=
les is balanced for almost all tables. But for some tables number of sstabl=
es is not really balanced, like [11, 2, 4, 4, 4, 2, 2, 2, 2, 5] or=C2=A0[43=
9, 558, 346, 521, 490, 553, 500, 515, 522, 495]</span></div><span class=3D"=
"><div><span style=3D"font-size:12.8px"><br></span></div><div>&gt;=C2=A0<sp=
an style=3D"font-size:12.8px">Also do you run full or incremental repairs?<=
/span></div></span><div><span style=3D"font-size:12.8px">We run incremental=
 repairs</span></div><span class=3D""><div><span style=3D"font-size:12.8px"=
><br></span></div><div><span style=3D"font-size:12.8px">&gt;=C2=A0</span><s=
pan style=3D"font-size:12.8px">Do you use LCS or do some manual compactions=
 (or trigger any anti-compactions)?</span></div></span><div><span style=3D"=
font-size:12.8px">We use=C2=A0</span><span style=3D"font-size:12.8px">LCS f=
or all tables and MVs. We don&#39;t do=C2=A0</span><span style=3D"font-size=
:12.8px">manual compactions (or trigger any anti-compactions)</span></div><=
span class=3D""><div><span style=3D"font-size:12.8px"><br></span></div><div=
><span style=3D"font-size:12.8px">&gt;=C2=A0</span><span style=3D"font-size=
:12.8px">How is CPU doing, is there any burst in CPU that could be related =
to these errors?</span></div></span><div><span style=3D"font-size:12.8px">U=
nfortunately, stats for period when there were InternalResponseState pendin=
g tasks is lost.</span></div><span class=3D""><div><span style=3D"font-size=
:12.8px"><br></span></div><div><span style=3D"font-size:12.8px">&gt;=C2=A0<=
/span><span style=3D"font-size:12.8px">What number of=C2=A0</span><span sty=
le=3D"font-size:12.8px">MemtableFlushWriter are you using</span></div></spa=
n><div><span style=3D"font-size:12.8px">We do not specify if, so default is=
 used</span></div><div><span style=3D"font-size:12.8px"><br></span></div><d=
iv><div>&gt;=C2=A0<span style=3D"font-size:12.8px">What tasks were dropped<=
/span></div><div><span style=3D"font-size:12.8px">One column for each node<=
/span></div><div><font face=3D"monospace, monospace">READ: =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 2609 7 =C2=A0 0 =C2=A0 0 =C2=A0 2 =C2=A0=
 1 =C2=A0 2 =C2=A0 0 =C2=A0 1 =C2=A0 1</font></div><div><font face=3D"monos=
pace, monospace">HINT: =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0 =
=C2=A0 =C2=A00 =C2=A0 0 =C2=A0 0 =C2=A0 1 =C2=A0 0 =C2=A0 1 =C2=A0 0 =C2=A0=
 0 =C2=A0 0</font></div><div><font face=3D"monospace, monospace">MUTATION: =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 0 =C2=A0 =C2=A00 =C2=A0 1 =C2=A0 0 =C2=
=A0 0 =C2=A0 0 =C2=A0 0 =C2=A0 0 =C2=A0 0 =C2=A0 0</font></div><div><font f=
ace=3D"monospace, monospace">REQUEST_RESPONSE: =C2=A0 0 =C2=A0 =C2=A00 =C2=
=A0 0 =C2=A0 0 =C2=A0 2 =C2=A0 0 =C2=A0 0 =C2=A0 0 =C2=A0 0 =C2=A0 1</font>=
</div></div><div><br></div><div>Thanks</div></div><div class=3D"HOEnZb"><di=
v class=3D"h5"><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">20=
17-05-03 17:00 GMT+03:00 Alain RODRIGUEZ <span dir=3D"ltr">&lt;<a href=3D"m=
ailto:arodrime@gmail.com" target=3D"_blank">arodrime@gmail.com</a>&gt;</spa=
n>:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-=
left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">Hi,<br><br>A few com=
ments:<span><div><br><blockquote class=3D"gmail_quote" style=3D"margin:0px =
0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><spa=
n style=3D"font-size:12.8px">Long GC Pauses take about one minute</span><br=
></blockquote><br></div></span><div>This is huge. About JVM config, I haven=
&#39;t played much with G1GC, but the following seems to be a bad idea acco=
rding to comments:<br><br><div>## Main G1GC tunable: lowering the pause tar=
get will lower throughput and vise versa.</div><div>## 200ms is the JVM def=
ault and lowest viable setting</div><div>## 1000ms increases throughput. Ke=
ep it smaller than the timeouts in cassandra.yaml.</div><div>-XX:MaxGCPause=
Millis=3D15000<br><br><div># Save CPU time on large (&gt;=3D 16GB) heaps by=
 delaying region scanning</div><div># until the heap is 70% full. The defau=
lt in Hotspot 8u40 is 40%.</div><div>-XX:InitiatingHeapOccupancyPer<wbr>cen=
t=3D30</div><div># For systems with &gt; 8 cores, the default ParallelGCThr=
eads is 5/8 the number of logical cores.</div><div># Otherwise equal to the=
 number of cores when 8 or less.</div><div># Machines with &gt; 10 cores sh=
ould try setting these to &lt;=3D full cores.</div><div>-XX:ParallelGCThrea=
ds=3D8</div><div># By default, ConcGCThreads is 1/4 of ParallelGCThreads.</=
div><div># Setting both to the same value can reduce STW durations.</div><d=
iv>-XX:ConcGCThreads=3D8<br><br></div><br>Why did you move from defaults th=
at much? Would you consider giving=C2=A0defaults a try on a canary node and=
 monitor / compare GC times to other nodes?</div></div><div><span><br><bloc=
kquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:=
1px solid rgb(204,204,204);padding-left:1ex"><span style=3D"font-size:12.8p=
x">1)=C2=A0ColumnFamilyStore.java:542 - Failed unregistering mbean: org.apa=
che.cassandra.db:type=3D</span><span style=3D"font-size:12.8px">T<wbr>ables=
,keyspace=3D...,table=3D... =C2=A0from=C2=A0MigrationStage thread</span><br=
></blockquote><br></span>I am not sure about this one... :p<br><br></div><s=
pan><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;bor=
der-left:1px solid rgb(204,204,204);padding-left:1ex"><span style=3D"font-s=
ize:12.8px">2)=C2=A0Read 1715 live rows and 1505 tombstone cells for query =
... =C2=A0from=C2=A0ReadStage thread</span></blockquote><div><br></div></sp=
an><div>Half of what was read for this query was deleted data. With obvious=
 disk space, disk throughput and latency consequences. This is an entire to=
pic... Here is what I know about it: <a href=3D"http://thelastpickle.com/bl=
og/2016/07/27/about-deletes-and-tombstones.html" target=3D"_blank">thelastp=
ickle.com/blog/2016/07<wbr>/27/about-deletes-and-tombston<wbr>es.html</a>, =
I hope it will help you solving your issue.<span><br><br><blockquote class=
=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rg=
b(204,204,204);padding-left:1ex"><span style=3D"font-size:12.8px">3)=C2=A0G=
CInspector.java:282 - G1 Young Generation GC in 1725ms</span></blockquote><=
/span><div><br>This might be related to your GC configuration or some other=
 issues mentioned in your last mail.<span><br><br><blockquote class=3D"gmai=
l_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,20=
4,204);padding-left:1ex">=C2=A0<span style=3D"font-size:12.8px">About 3000-=
6000 CompactionExecutor pending tasks appeared on all nodes from time to ti=
me.</span></blockquote><div><br></div></span><div>Hum that&#39;s weird. It&=
#39;s huge, but as you have so many tables I am not sure, it might be a =
9;normal&#39; issue when running with so many tables and MVs.<br><br>What d=
o you mean from time to time? For how long are this task pending, what freq=
uency is this happening?</div><div><br>Is the number of sstables constant a=
nd balanced between nodes?</div><div>Also do you run full or incremental re=
pairs?<br>Do you use LCS or do some manual compactions (or trigger any anti=
-compactions)?</div><span><div>=C2=A0</div><blockquote class=3D"gmail_quote=
" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);=
padding-left:1ex"><span style=3D"font-size:12.8px">About 1000 MigrationStag=
e pending tasks appeared on 2 nodes.</span></blockquote></span><div><br>Tha=
t&#39;s pending writes. Meaning this Cassandra node can&#39;t cope with wha=
t what is thrown at it. It can be related to pending flushes (blocking writ=
es), huge Garbage Collection (Stop The World, including writes), due to har=
dware limits (CPU busy with compactions?) or even to a too conservative con=
figuration of the concurrent_write.<br>=C2=A0</div><span><blockquote class=
=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rg=
b(204,204,204);padding-left:1ex"><span style=3D"font-size:12.8px">About 700=
 InternalResponseState pending tasks appeared on 2 nodes.</span></blockquot=
e></span><div><br>I never had issues with this one and so didn&#39;t knew m=
uch about it. But according to Chris Lohfink in this post <a href=3D"https:=
//www.pythian.com/blog/guide-to-cassandra-thread-pools/#InternalResponseSta=
ge" target=3D"_blank">https://www.pythian.com/blog/g<wbr>uide-to-cassandra-=
thread-pools<wbr>/#InternalResponseStage</a>,=C2=A0this thread pool is resp=
onsible for &quot;Responding to non-client initiated messages, including bo=
otstrapping and schema checking&quot;. Which again might be related with th=
e huge number of tables in the cluster. How is CPU doing, is there any burs=
t in CPU that could be related to these errors?<br><br></div><span><blockqu=
ote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px=
 solid rgb(204,204,204);padding-left:1ex"><span style=3D"font-size:12.8px">=
 About 60=C2=A0MemtableFlushWriter appeared on 3 nodes.</span></blockquote>=
<div><br></div></span><div>What number of=C2=A0<span style=3D"font-size:12.=
8px">MemtableFlushWriter are you using. Consider increasing it (or maybe th=
e memtable size).</span></div><span><div>=C2=A0</div><blockquote style=3D"m=
argin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left=
:1ex" class=3D"gmail_quote"><span style=3D"font-size:12.8px">There were no =
blocked tasks, but there were &quot;All time blocked&quot; tasks (they were=
 before starting dropping tables) from 3 millions to 20 millions on differe=
nt nodes.</span></blockquote><div><br></div></span><div>What tasks were dro=
pped.</div><div><br></div><div>The cluster doesn&#39;t look completely heal=
thy, but I believe it is possible to improve things, before thinking about =
splitting tables in multiples cluster. I would definitely not add more tabl=
es though...</div></div></div><span><div><br></div><div>C*heers,</div><div>=
<div>-----------------------</div><div>Alain Rodriguez - @arodream - <a hre=
f=3D"mailto:alain@thelastpickle.com" target=3D"_blank">alain@thelastpickle.=
com</a></div><div>France</div><div><br></div><div>The Last Pickle - Apache =
Cassandra Consulting</div><div><a href=3D"http://www.thelastpickle.com" tar=
get=3D"_blank">http://www.thelastpickle.com</a></div></div></span></div><di=
v class=3D"m_-4408426326570762988HOEnZb"><div class=3D"m_-44084263265707629=
88h5"><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">2017-04-28 =
14:35 GMT+01:00 Bohdan Tantsiura <span dir=3D"ltr">&lt;<a href=3D"mailto:bo=
hdantan@gmail.com" target=3D"_blank">bohdantan@gmail.com</a>&gt;</span>:<br=
><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1=
px #ccc solid;padding-left:1ex"><div dir=3D"ltr">Thanks Alain,<div><span><b=
r><div>&gt;=C2=A0<span style=3D"font-size:12.8px">Or is it on happening dur=
ing drop table actions?</span></div></span><div><span style=3D"font-size:12=
.8px">Some other schema changes (e.g. adding columns to tables) also takes =
too much time.</span></div><div><span style=3D"font-size:12.8px"><br></span=
></div><div><span style=3D"font-size:12.8px">Link to complete set of GC opt=
ions:=C2=A0<a href=3D"https://pastebin.com/4qyENeyu" target=3D"_blank">http=
s://pastebin.com/<wbr>4qyENeyu</a></span></div><span><div><span style=3D"fo=
nt-size:12.8px"><br></span></div><div><span style=3D"font-size:12.8px">&gt;=
=C2=A0</span><span style=3D"font-size:12.8px">Have you had a look at logs, =
mainly errors and warnings?</span></div></span><div><span style=3D"font-siz=
e:12.8px">In logs I found warnings of 3 types:</span></div><div><span style=
=3D"font-size:12.8px">1)=C2=A0ColumnFamilyStore.java:542 - Failed unregiste=
ring mbean: org.apache.cassandra.db:type=3DT<wbr>ables,keyspace=3D...,table=
=3D... =C2=A0from=C2=A0MigrationStage thread</span></div><div><span style=
=3D"font-size:12.8px">2)=C2=A0Read 1715 live rows and 1505 tombstone cells =
for query ... =C2=A0from=C2=A0ReadStage thread</span></div><div><span style=
=3D"font-size:12.8px">3)=C2=A0GCInspector.java:282 - G1 Young Generation GC=
 in 1725ms.=C2=A0 G1 Eden Space: 38017171456 -&gt; 0; G1 Survivor Space: <a=
 href=3D"tel:(251)%20658-2400" value=3D"+12516582400" target=3D"_blank">251=
6582400</a> -&gt; 2650800128; from=C2=A0Service Thread</span></div><span><d=
iv><span style=3D"font-size:12.8px"><br></span></div><div><span style=3D"fo=
nt-size:12.8px">&gt;=C2=A0</span><span style=3D"font-size:12.8px">Are they =
some pending, blocked or dropped tasks in thread pool stats?</span></div></=
span><div><span style=3D"font-size:12.8px">About 3000-6000 CompactionExecut=
or pending tasks appeared on all nodes from time to time. About 1000 Migrat=
ionStage pending tasks appeared on 2 nodes. About 700 InternalResponseState=
 pending tasks appeared on 2 nodes. About 60=C2=A0MemtableFlushWriter appea=
red on 3 nodes.</span></div><div><span style=3D"font-size:12.8px">There wer=
e no blocked tasks, but there were &quot;All time blocked&quot; tasks (they=
 were before starting dropping tables) from 3 millions to 20 millions on di=
fferent nodes.</span></div><span><div><span style=3D"font-size:12.8px"><br>=
</span></div><div><span style=3D"font-size:12.8px">&gt;=C2=A0</span><span s=
tyle=3D"font-size:12.8px">Are some resources constraint (CPU / disk IO,...)=
?</span></div></span><div><span style=3D"font-size:12.8px">CPU and=C2=A0</s=
pan><span style=3D"font-size:12.8px">disk IO are not constraint</span></div=
></div><div><span style=3D"font-size:12.8px"><br></span></div><div><span st=
yle=3D"font-size:12.8px">Thanks</span></div></div><div class=3D"m_-44084263=
26570762988m_-1045362158189691338HOEnZb"><div class=3D"m_-44084263265707629=
88m_-1045362158189691338h5"><div class=3D"gmail_extra"><br><div class=3D"gm=
ail_quote">2017-04-27 11:10 GMT+03:00 Alain RODRIGUEZ <span dir=3D"ltr">&lt=
;<a href=3D"mailto:arodrime@gmail.com" target=3D"_blank">arodrime@gmail.com=
</a>&gt;</span>:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0=
 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div>Hi=
</div><span><div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"mar=
gin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1=
ex"><span style=3D"font-size:12.8px">Long GC Pauses take about one minute. =
But why it takes so much time and how that can be fixed?</span></blockquote=
><div><br></div></span><div>This is very long. Looks like you are having a =
major issue, and it is not just about dropping tables... Or is it on happen=
ing during drop table actions? Knowing the complete set of GC options in us=
e could help here, could you paste it here (or link to it)?=C2=A0</div><div=
><br></div><div>Also, GC is often high as a consequence of other issues and=
 not only when &#39;badly=E2=80=98 tuned</div><div><br></div><div><ul><li>H=
ave you had a look at logs, mainly errors and warnings?<br><br>$ grep -e &q=
uot;ERROR&quot; -e &quot;WARN&quot; /var/log/cassandra/system.log<br><br></=
li><li>Are they some pending, blocked or dropped tasks in thread pool stats=
?=C2=A0<br><br>$ watch -d nodetool tpstats<br><br></li><li>Are some resourc=
es constraint (CPU / disk IO,...)?<br></li></ul><span><div><br></div><block=
quote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1=
px solid rgb(204,204,204);padding-left:1ex"><span style=3D"font-size:12.8px=
">We have about 60 keyspaces with about 80 tables in each keyspace</span>=
=C2=A0</blockquote></span><blockquote class=3D"gmail_quote" style=3D"margin=
:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"=
><span style=3D"font-size:12.8px">In each keyspace we also have 11 MVs</spa=
n></blockquote><div><br>Even if I believe we can dig it and maybe improve t=
hings, I agree with Carlos, this is a lot of Tables (4880) and even more a =
high number of MV (660). It might be interesting splitting it somehow if po=
ssible.</div></div><span><div><br></div><blockquote class=3D"gmail_quote" s=
tyle=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);pad=
ding-left:1ex"><span style=3D"font-size:12.8px">Cannot achieve consistency =
level ALL</span></blockquote><div><br></div></span><div>Finally you could t=
ry to adjust the corresponding request timeout (not sure if it is the globa=
l one or the truncate timeout), so it may succeed even when nodes are havin=
g minutes GC, but it is a workaround as this minute GC will most definitely=
 be an issue for the client queries running (default is 10 sec timeout, so =
many query are probably failing).</div><div><br></div><div>C*heers,</div><d=
iv><div>-----------------------</div><div>Alain Rodriguez - @arodream - <a =
href=3D"mailto:alain@thelastpickle.com" target=3D"_blank">alain@thelastpick=
le.com</a></div><div>France</div><div><br></div><div>The Last Pickle - Apac=
he Cassandra Consulting</div><div><a href=3D"http://www.thelastpickle.com" =
target=3D"_blank">http://www.thelastpickle.com</a></div></div></div><div cl=
ass=3D"m_-4408426326570762988m_-1045362158189691338m_-830194251682398690HOE=
nZb"><div class=3D"m_-4408426326570762988m_-1045362158189691338m_-830194251=
682398690h5"><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">2017=
-04-25 13:58 GMT+02:00 Bohdan Tantsiura <span dir=3D"ltr">&lt;<a href=3D"ma=
ilto:bohdantan@gmail.com" target=3D"_blank">bohdantan@gmail.com</a>&gt;</sp=
an>:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border=
-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">Thanks Zhao Yang,<s=
pan><br><br>&gt;=C2=A0<span style=3D"font-size:12.8px">Could you try some j=
vm tool to find out which thread are allocating memory or gc? maybe the mig=
ration stage thread..</span></span><div><span style=3D"font-size:12.8px"><b=
r></span>I use Cassandra Cluster Manager to locally reproduce the issue. I =
tried to use VisualVM=C2=A0<span style=3D"font-size:12.8px">to find out whi=
ch threads are allocating memory, but=C2=A0</span>VisualVM does not see cas=
sandra processes and says &quot;Cannot open application with pid&quot;. The=
n I tried to use YourKit Java Profiler. It created snapshot when process of=
 one cassandra node failed.=C2=A0<a href=3D"http://i.imgur.com/9jBcjcl.png"=
 target=3D"_blank">http://i.imgur.com/9jB<wbr>cjcl.png</a> - how CPU is use=
d by threads.=C2=A0<a href=3D"http://i.imgur.com/ox5Sozy.png" target=3D"_bl=
ank">http://i.imgur.com/ox<wbr>5Sozy.png</a> - how memory is used by thread=
s, but biggest part of memory is used by objects without allocation informa=
tion.=C2=A0<a href=3D"http://i.imgur.com/oqx9crX.png" target=3D"_blank">htt=
p://i.imgur.co<wbr>m/oqx9crX.png</a> - which objects use biggest part of me=
mory. Maybe you know some other good jvm tool that can show by which thread=
s biggest part of memory is used?</div><span><div><br></div><div>&gt;=C2=A0=
<span style=3D"font-size:12.8px">BTW, is your cluster under high load while=
 dropping table?</span></div><div><span style=3D"font-size:12.8px"><br></sp=
an></div></span><div><span style=3D"font-size:12.8px">LA5 was &lt;=3D 5 on =
all nodes almost all time while dropping tables</span></div><div><span styl=
e=3D"font-size:12.8px"><br></span></div><div><span style=3D"font-size:12.8p=
x">Thanks</span></div></div><div class=3D"m_-4408426326570762988m_-10453621=
58189691338m_-830194251682398690m_4610159798155184426HOEnZb"><div class=3D"=
m_-4408426326570762988m_-1045362158189691338m_-830194251682398690m_46101597=
98155184426h5"><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">20=
17-04-21 19:49 GMT+03:00 Jasonstack Zhao Yang <span dir=3D"ltr">&lt;<a href=
=3D"mailto:zhaoyangsingapore@gmail.com" target=3D"_blank">zhaoyangsingapore=
@gmail.com</a>&gt;</span>:<br><blockquote class=3D"gmail_quote" style=3D"ma=
rgin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"lt=
r">Hi Bohdan, Carlos,<div><br></div><div>Could you try some jvm tool to fin=
d out which thread are allocating memory or gc? maybe the migration stage t=
hread..</div><div><br></div><div>BTW, is your cluster under high load while=
 dropping table?</div><div><br></div><div>As far as I remember, in older c*=
 version, it applies the schema mutation in memory, ie. DROP, then flush al=
l schema info into sstable, then reads all on disk schema into memory (5k t=
ables info + related column info)..</div><span><div><br></div><div>&gt;=C2=
=A0<span style=3D"color:rgb(33,33,33)">You also might need to increase the =
node count if you&#39;re resource constrained.</span></div><div><span style=
=3D"color:rgb(33,33,33)"><br></span></div></span><div><span style=3D"color:=
rgb(33,33,33)">More nodes won&#39;t help and most probably make it worse du=
e to coordination.</span></div><span class=3D"m_-4408426326570762988m_-1045=
362158189691338m_-830194251682398690m_4610159798155184426m_-427047481259350=
1713HOEnZb"><font color=3D"#888888"><div><br></div><div><br></div><div>Zhao=
 Yang</div><div><br></div><div><br></div></font></span></div><div class=3D"=
m_-4408426326570762988m_-1045362158189691338m_-830194251682398690m_46101597=
98155184426m_-4270474812593501713HOEnZb"><div class=3D"m_-44084263265707629=
88m_-1045362158189691338m_-830194251682398690m_4610159798155184426m_-427047=
4812593501713h5"><br><div class=3D"gmail_quote"><div dir=3D"ltr">On Fri, 21=
 Apr 2017 at 21:10 Bohdan Tantsiura &lt;<a href=3D"mailto:bohdantan@gmail.c=
om" target=3D"_blank">bohdantan@gmail.com</a>&gt; wrote:<br></div><blockquo=
te class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc so=
lid;padding-left:1ex"><div dir=3D"ltr">Hi,<div><br></div><div>Problem is st=
ill not solved. Does anybody have any idea what to do with it?</div><div><b=
r></div><div>Thanks</div></div><div class=3D"gmail_extra"><br><div class=3D=
"gmail_quote">2017-04-20 15:05 GMT+03:00 Bohdan Tantsiura <span dir=3D"ltr"=
>&lt;<a href=3D"mailto:bohdantan@gmail.com" target=3D"_blank">bohdantan@gma=
il.com</a>&gt;</span>:<br><blockquote class=3D"gmail_quote" style=3D"margin=
:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr"><=
div>Thanks Carlos,</div><div><br></div>In each keyspace we also have 11 MVs=
.<div><br></div><div>It is impossible to reduce number of tables now. Long =
GC Pauses take about one minute. But why it takes so much time and how that=
 can be fixed?</div><div><br></div><div>Each node in cluster has 128GB RAM,=
 so=C2=A0<span style=3D"font-size:12.8px">resources are not constrained now=
</span></div><div><br></div><div>Thanks</div></div><div class=3D"m_-4408426=
326570762988m_-1045362158189691338m_-830194251682398690m_461015979815518442=
6m_-4270474812593501713m_-3657364804629000280m_-4815479175655682300HOEnZb">=
<div class=3D"m_-4408426326570762988m_-1045362158189691338m_-83019425168239=
8690m_4610159798155184426m_-4270474812593501713m_-3657364804629000280m_-481=
5479175655682300h5"><div class=3D"gmail_extra"><br><div class=3D"gmail_quot=
e">2017-04-20 13:18 GMT+03:00 Carlos Rolo <span dir=3D"ltr">&lt;<a href=3D"=
mailto:rolo@pythian.com" target=3D"_blank">rolo@pythian.com</a>&gt;</span>:=
<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-lef=
t:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">You have 4800 Tables in=
 total? That is a lot of tables, plus MVs? or MVs are already considered in=
 the 60*80 account?<div><br></div><div>I would recommend to reduce the tabl=
e number. Other thing is that you need to check your log file for GC Pauses=
, and how long those pauses take.=C2=A0</div><div><br></div><div>You also m=
ight need to increase the node count if you&#39;re resource constrained.</d=
iv></div><div class=3D"gmail_extra"><br clear=3D"all"><div><div class=3D"m_=
-4408426326570762988m_-1045362158189691338m_-830194251682398690m_4610159798=
155184426m_-4270474812593501713m_-3657364804629000280m_-4815479175655682300=
m_-4528001472318957043m_-7425516900999196820gmail_signature" data-smartmail=
=3D"gmail_signature"><div dir=3D"ltr"><div><div dir=3D"ltr"><div><div dir=
=3D"ltr"><div><div dir=3D"ltr"><div><div dir=3D"ltr"><div>Regards,<br></div=
><div><br></div><div>Carlos Juzarte Rolo</div><div>Cassandra Consultant / D=
atastax Certified Architect / Cassandra MVP<br></div><div>=C2=A0</div><div>=
Pythian - Love your data</div><div><br></div><div>rolo@pythian | Twitter: @=
cjrolo | Skype: cjr2k3 | Linkedin: <font color=3D"#1155cc"><u><a href=3D"ht=
tp://linkedin.com/in/carlosjuzarterolo" target=3D"_blank">linkedin.com/in/c=
arlosjuzarter<wbr>olo <br></a></u></font></div><div>Mobile: <a href=3D"tel:=
+351%20918%20918%20100" value=3D"+351918918100" target=3D"_blank">+351 918 =
918 100</a> <br></div><div><a href=3D"http://www.pythian.com/" style=3D"col=
or:rgb(17,85,204)" target=3D"_blank">www.pythian.com</a></div></div></div><=
/div></div></div></div></div></div></div></div></div><div><div class=3D"m_-=
4408426326570762988m_-1045362158189691338m_-830194251682398690m_46101597981=
55184426m_-4270474812593501713m_-3657364804629000280m_-4815479175655682300m=
_-4528001472318957043h5">
<br><div class=3D"gmail_quote">On Thu, Apr 20, 2017 at 11:10 AM, Bohdan Tan=
tsiura <span dir=3D"ltr">&lt;<a href=3D"mailto:bohdantan@gmail.com" target=
=3D"_blank">bohdantan@gmail.com</a>&gt;</span> wrote:<br><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex"><div dir=3D"ltr"><div style=3D"font-size:12.8px">Hi,</div><di=
v style=3D"font-size:12.8px"><br></div><div style=3D"font-size:12.8px">We a=
re using cassandra 3.10 in a 10 nodes cluster with replication =3D 3. MAX_H=
EAP_SIZE=3D64GB on all nodes, G1 GC is used. We have about 60 keyspaces wit=
h about 80 tables in each keyspace. We had to delete three tables and two m=
aterialized views from each keyspace. It began to take more and more time f=
or each next keyspace (for some keyspaces it took about 30 minutes) and the=
n failed with &quot;Cannot achieve consistency level ALL&quot;. After resta=
rting the same repeated. It seems that cassandra hangs on GC. How that can =
be solved?</div><div style=3D"font-size:12.8px"><br></div><div style=3D"fon=
t-size:12.8px">Thanks</div></div>
</blockquote></div><br></div></div></div>

<br>
<p>--</p><p><br><br></p><p></p></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</blockquote></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div>

--f4030435bb00b08852054f467db6--