cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Marcus Eriksson (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (CASSANDRA-9033) Upgrading from 2.1.1 to 2.1.3 with LCS and many sstable files makes nodes unresponsive
Date Thu, 26 Mar 2015 17:30:55 GMT

     [ https://issues.apache.org/jira/browse/CASSANDRA-9033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Marcus Eriksson resolved CASSANDRA-9033.
----------------------------------------
    Resolution: Duplicate

> Upgrading from 2.1.1 to 2.1.3 with LCS  and many sstable files makes nodes unresponsive
> ---------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-9033
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9033
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: * Ubuntu 14.04.2 - Linux ip-10-0-2-122 3.13.0-46-generic #79-Ubuntu
SMP Tue Mar 10 20:06:50 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
> * EC2 m2-xlarge instances [4cpu, 16GB RAM, 1TB storage on 3 platters]
> * 12 nodes running a mix of 2.1.1 and 2.1.3
> * 8GB stack size with offheap objects
>            Reporter: Brent Haines
>            Assignee: Marcus Eriksson
>         Attachments: cassandra-env.sh, cassandra.yaml, system.log.1.zip
>
>
> We have an Event Log table using LCS that has grown fast. There are more than 100K sstable
files that are around 1KB. Increasing compactors and adjusting compaction throttling upward
doesn't make a difference. It has been running great though until we upgraded to 2.1.3. Those
nodes needed more RAM for the stack (12 GB) to even have a prayer of responding to queries.
They bog down and become unresponsive. There are no GC messages that I can see, and no compaction
either. 
> The only work-around I have found is to decommission, blow away the big CF and rejoin.
That happens in about 20 minutes and everything is freaking happy again. The size of the files
is more like what I'd expect as well. 
> Our schema: 
> {code}
> cqlsh> describe columnfamily data.stories
> CREATE TABLE data.stories (
>     id timeuuid PRIMARY KEY,
>     action_data timeuuid,
>     action_name text,
>     app_id timeuuid,
>     app_instance_id timeuuid,
>     data map<text, text>,
>     objects set<timeuuid>,
>     time_stamp timestamp,
>     user_id timeuuid
> ) WITH bloom_filter_fp_chance = 0.01
>     AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
>     AND comment = 'Stories represent the timeline and are placed in the dashboard for
the brand manager to see'
>     AND compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
'max_threshold': '32'}
>     AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
>     AND dclocal_read_repair_chance = 0.1
>     AND default_time_to_live = 0
>     AND gc_grace_seconds = 864000
>     AND max_index_interval = 2048
>     AND memtable_flush_period_in_ms = 0
>     AND min_index_interval = 128
>     AND read_repair_chance = 0.0
>     AND speculative_retry = '99.0PERCENTILE';
> cqlsh> 
> {code}
> There were no log entries that stood out. It pretty much consisted of "x is down" "x
is up" repeated ad infinitum. I have attached the zipped system.log that has the situation
after the upgrade and then after I stopped, removed system, system_traces, OpsCenter, and
data/stories-/* and restarted. 
> It has rejoined the cluster now and is busy read-repairing to recover its data.
> On another note, we see a lot of this during repair now (on all the nodes): 
> {code}
> ERROR [AntiEntropySessions:5] 2015-03-24 20:03:10,207 RepairSession.java:303 - [repair
#c5043c40-d260-11e4-a2f2-8bb3e2bbdb35] session completed with the following error
> java.io.IOException: Failed during snapshot creation.
>         at org.apache.cassandra.repair.RepairSession.failedSnapshot(RepairSession.java:344)
~[apache-cassandra-2.1.3.jar:2.1.3]
>         at org.apache.cassandra.repair.RepairJob$2.onFailure(RepairJob.java:146) ~[apache-cassandra-2.1.3.jar:2.1.3]
>         at com.google.common.util.concurrent.Futures$4.run(Futures.java:1172) ~[guava-16.0.jar:na]
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
[na:1.7.0_55]
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[na:1.7.0_55]
>         at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55]
> ERROR [AntiEntropySessions:5] 2015-03-24 20:03:10,208 CassandraDaemon.java:167 - Exception
in thread Thread[AntiEntropySessions:5,5,RMI Runtime]
> java.lang.RuntimeException: java.io.IOException: Failed during snapshot creation.
>         at com.google.common.base.Throwables.propagate(Throwables.java:160) ~[guava-16.0.jar:na]
>         at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32) ~[apache-cassandra-2.1.3.jar:2.1.3]
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[na:1.7.0_55]
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[na:1.7.0_55]
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
~[na:1.7.0_55]
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[na:1.7.0_55]
>         at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55]
> Caused by: java.io.IOException: Failed during snapshot creation.
>         at org.apache.cassandra.repair.RepairSession.failedSnapshot(RepairSession.java:344)
~[apache-cassandra-2.1.3.jar:2.1.3]
>         at org.apache.cassandra.repair.RepairJob$2.onFailure(RepairJob.java:146) ~[apache-cassandra-2.1.3.jar:2.1.3]
>         at com.google.common.util.concurrent.Futures$4.run(Futures.java:1172) ~[guava-16.0.jar:na]
>         ... 3 common frames omitted
> ERROR [RepairJobTask:2] 2015-03-24 20:03:20,227 RepairJob.java:145 - Error occurred during
snapshot phase
> java.lang.RuntimeException: Could not create snapshot at /10.0.2.144
>         at org.apache.cassandra.repair.SnapshotTask$SnapshotCallback.onFailure(SnapshotTask.java:77)
~[apache-cassandra-2.1.3.jar:2.1.3]
>         at org.apache.cassandra.net.MessagingService$5$1.run(MessagingService.java:349)
~[apache-cassandra-2.1.3.jar:2.1.3]
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) ~[na:1.7.0_55]
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[na:1.7.0_55]
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
[na:1.7.0_55]
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[na:1.7.0_55]
>         at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55]
> {code} 
> I am thinking that this means that my work-around for blowing away and rebuilding the
CF may not be working anymore. I don't know of another way to force LCS compaction. The node
doesn't ever seem to recover enough to compact on its own.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message