cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paulo Motta (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-10797) Bootstrap new node fails with OOM when streaming nodes contains thousands of sstables
Date Fri, 11 Dec 2015 12:47:11 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-10797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15052692#comment-15052692
] 

Paulo Motta edited comment on CASSANDRA-10797 at 12/11/15 12:46 PM:
--------------------------------------------------------------------

As mentioned before I was able to reproduce the OOM with 1000 small sstables and 50M heap.
I attached a [ccm cluster|https://issues.apache.org/jira/secure/attachment/12777032/dtest.tar.gz]
with 2 nodes. In order to reproduce, extract the {{dtest.tar.gz}} in the {{~/.ccm}} folder,
update the following properties to match your local directories on {{dtest/node*/conf/cassandra.yaml}}:
{{commitlog_directory}}, {{data_file_directories}} and {{saved_caches_directory}}. After that,
run the following commands:
{noformat}
ccm switch dtest
ccm node1 start
sleep 10
ccm node2 start (will throw OOM)
{noformat}

The main problem is that all {{SStableWriters}} remain open until the end of the stream receive
task, and these objects are quite large with indexes and stats that are written to disk only
when the {{SStableWriters}} are closed. 

Before CASSANDRA-6503, {{SStableWriters}} were closed as soon as they were received, and the
stream receive task kept only the {{SStableReaders}} which have a much smaller memory footprint.
The main reason to defer the closing of the {{SStableWriter}} to the end of the stream receive
task was to keep sstables temporary (with {{-tmp}} infix), avoiding stale sstables to reappear
if the machines are restarted after a failed repair session. A discussed alternative was to
close the {{SStableWriter}} without removing the {{-tmp}} infix, and performing an atomic
rename in the end of the stream task. However, this alternative was disregarded as the {{SStableReader}}
would need to be closed and reopened in order to perform the atomic rename on non-posix systems
such as Windows.

CASSANDRA-6503 also introduced the {{StreamLockFile}} to remove already-closed {{SStableWriters}}
if the node goes down before these files are processed in the end of the stream receive task.
So, the proposed solution basically returns to the previous behavior of closing {{SStableWriters}}
as soon as they are received, while adding already-closed-but-not-yet-live files to the {{StreamLockFile}}.
As soon as the sstables are added to the data tracker, the {{StreamLockFile}} is removed.
If the stream session fails before that, the already-closed-but-not-yet-live sstables are
cleaned up. If there is a failure while adding files to the data tracker, only the files that
were not yet added to the data tracker are removed since they were already live. If the node
goes down during a stream session, the already-closed-but-not-yet-live sstables present in
the {{StreamLockFile}} are removed on the next startup as done today.

Since {{StreamLockFile}} is a much more critical component with this approach, I added unit
tests to verify that {{append}}, {{cleanup}}, {{skip}} and {{delete}} are working correctly.
We also need to ignore sstables that are present on a {{StreamLockFile}} during {{nodetool
refresh}}. I will do that after first review if this approach is validated.

Below are some test results with and without the patch, with constrained (50M) and unconstrained
(500M) memory.



||*||unpatched||patched||
||constrained|!10797-nonpatched.png!|!10797-patched.png!|
||unconstrained|!10798-nonpatched-500M.png!|!10798-patched-500M.png!|

In the constrained case, the unpatched version OOM soon after starting bootstrap while the
patched version finished bootstrap successfully. In the unconstrained case, the memory footprint
is between 1/2 to 1/3 smaller, but the difference is probably much larger in the case of large
sstables.

Below is the initial patch and tests:

||2.1||
|[branch|https://github.com/apache/cassandra/compare/cassandra-2.1...pauloricardomg:2.1-10797]|
|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.1-10797-testall/lastCompletedBuild/testReport/]|
|[dtest|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.1-10797-dtest/lastCompletedBuild/testReport/]|

I will provide 2.2+ versions after review.


was (Author: pauloricardomg):
As mentioned before I was able to reproduce the OOM with 1000 small sstables and 50M heap.
I attached a [ccm cluster|https://issues.apache.org/jira/secure/attachment/12777032/dtest.tar.gz]
with 2 nodes. In order to reproduce, extract the {{dtest.tar.gz}} in the {{~/.ccm}} folder,
update the following properties to match your local directories on {{dtest/node*/conf/cassandra.yaml}}:
{{commitlog_directory}}, {{data_file_directories}} and {{saved_caches_directory}}. After that,
run the following commands:
{noformat}
ccm switch dtest
ccm node1 start
sleep 10
ccm node2 start (will throw OOM)
{noformat}

The main problem is that all {{SStableWriters}} remain open until the end of the stream receive
task, and these objects are quite large with indexes and stats that are written to disk only
when the {{SStableWriters}} are closed. 

Before CASSANDRA-6503, {{SStableWriters}} were closed as soon as they were received, and the
stream receive task kept only the {{SStableReaders}} which have a much smaller memory footprint.
The main reason to defer the closing of the {{SStableWriter}} to the end of the stream receive
task was to keep sstables temporary (with {{-tmp}} infix), avoiding stale sstables to reappear
if the machines are restarted after a failed repair session. A discussed alternative was to
close the {{SStableWriter}} without removing the {{-tmp}} infix, and performing an atomic
rename in the end of the stream task. However, this alternative was disregarded as the {{SStableReader}}
would need to be closed and reopened in order to perform the atomic rename on non-posix systems
such as Windows.

CASSANDRA-6503 also introduced the {{StreamLockFile}} to remove already-closed {{SStableWriters}}
if the node goes down before these files are processed in the end of the stream receive task.
So, the proposed solution basically returns to the previous behavior of closing {{SStableWriters}}
as soon as they are received, while adding already-closed-but-not-yet-live files to the {{StreamLockFile}}.
As soon as the sstables are added to the data tracker, the {{StreamLockFile}} is removed.
If the stream session fails before that, the already-closed-but-not-yet-live sstables are
cleaned up. If there is a failure while adding files to the data tracker, only the files that
were not yet added to the data tracker are removed since they were already live. If the node
goes down during a stream session, the already-closed-but-not-yet-live sstables present in
the {{StreamLockFile}} are removed on the next startup as done today.

Since {{StreamLockFile}} is a much more critical component with this approach, I added unit
tests to verify that {{append}}, {{cleanup}}, {{skip}} and {{delete}} are working correctly.
We also need to ignore sstables that are present on a {{StreamLockFile}} during {{nodetool
refresh}}. I will do that after first review if this approach is validated.

Below are some test results with and without the patch, with constrained (50M) and unconstrained
(500M) memory.



||*||unpatched||patched||
||constrained|!10797-nonpatched.png!|!10797-patched.png!|
||unconstrained|!10798-nonpatched-500M.png!|!10798-patched-500M.png!|

In the constrained case, the unpatched version OOM soon after starting bootstrap while the
patched version finished bootstrap successfully. In the unconstrained case, the memory footprint
is between 1/2 to 1/3 smaller, but the difference is probably much larger in the case of large
sstables.

I will provide 2.2+ versions after review.

> Bootstrap new node fails with OOM when streaming nodes contains thousands of sstables
> -------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-10797
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10797
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Streaming and Messaging
>         Environment: Cassandra 2.1.8.621 w/G1GC
>            Reporter: Jose Martinez Poblete
>            Assignee: Paulo Motta
>             Fix For: 2.1.x
>
>         Attachments: 10797-nonpatched.png, 10797-patched.png, 10798-nonpatched-500M.png,
10798-patched-500M.png, 112415_system.log, Heapdump_OOM.zip, Screen Shot 2015-12-01 at 7.34.40
PM.png, dtest.tar.gz
>
>
> When adding a new node to an existing DC, it runs OOM after 25-45 minutes
> Upon heapdump revision, it is found the sending nodes are streaming thousands of sstables
which in turns blows the bootstrapping node heap 
> {noformat}
> ERROR [RMI Scheduler(0)] 2015-11-24 10:10:44,585 JVMStabilityInspector.java:94 - JVM
state determined to be unstable.  Exiting forcefully due to:
> java.lang.OutOfMemoryError: Java heap space
> ERROR [STREAM-IN-/173.36.28.148] 2015-11-24 10:10:44,585 StreamSession.java:502 - [Stream
#0bb13f50-92cb-11e5-bc8d-f53b7528ffb4] Streaming error occurred
> java.lang.IllegalStateException: Shutdown in progress
>         at java.lang.ApplicationShutdownHooks.remove(ApplicationShutdownHooks.java:82)
~[na:1.8.0_65]
>         at java.lang.Runtime.removeShutdownHook(Runtime.java:239) ~[na:1.8.0_65]
>         at org.apache.cassandra.service.StorageService.removeShutdownHook(StorageService.java:747)
~[cassandra-all-2.1.8.621.jar:2.1.8.621]
>         at org.apache.cassandra.utils.JVMStabilityInspector$Killer.killCurrentJVM(JVMStabilityInspector.java:95)
~[cassandra-all-2.1.8.621.jar:2.1.8.621]
>         at org.apache.cassandra.utils.JVMStabilityInspector.inspectThrowable(JVMStabilityInspector.java:64)
~[cassandra-all-2.1.8.621.jar:2.1.8.621]
>         at org.apache.cassandra.streaming.messages.IncomingFileMessage$1.deserialize(IncomingFileMessage.java:66)
~[cassandra-all-2.1.8.621.jar:2.1.8.621]
>         at org.apache.cassandra.streaming.messages.IncomingFileMessage$1.deserialize(IncomingFileMessage.java:38)
~[cassandra-all-2.1.8.621.jar:2.1.8.621]
>         at org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:55)
~[cassandra-all-2.1.8.621.jar:2.1.8.621]
>         at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:250)
~[cassandra-all-2.1.8.621.jar:2.1.8.621]
>         at java.lang.Thread.run(Thread.java:745) [na:1.8.0_65]
> ERROR [RMI TCP Connection(idle)] 2015-11-24 10:10:44,585 JVMStabilityInspector.java:94
- JVM state determined to be unstable.  Exiting forcefully due to:
> java.lang.OutOfMemoryError: Java heap space
> ERROR [OptionalTasks:1] 2015-11-24 10:10:44,585 CassandraDaemon.java:223 - Exception
in thread Thread[OptionalTasks:1,5,main]
> java.lang.IllegalStateException: Shutdown in progress
> {noformat}
> Attached is the Eclipse MAT report as a zipped web page



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message