cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jacek Furmankiewicz (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-7361) Cassandra locks up in full GC when you assign the entire heap to row cache
Date Fri, 13 Jun 2014 15:07:02 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-7361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14030703#comment-14030703
] 

Jacek Furmankiewicz edited comment on CASSANDRA-7361 at 6/13/14 3:06 PM:
-------------------------------------------------------------------------

For what it's worth, here are some results of our testing. I hope this will be helpful.

We set Cassandra to 24 GB heap, row cache to 16 GB. Lucked via numactl cpubind to a single
NUMA node with 16 cores (on a  64 core server).

We changed all the JVM_OPTS in bin/cassandra to simply use G1 instead of all the other GC
settings

--------------------------------------------------
# enable assertions.  disabling this in production will give a modest
# performance benefit (around 5%).
JVM_OPTS="$JVM_OPTS -ea"

# add the jamm javaagent
JVM_OPTS="$JVM_OPTS -javaagent:$CASSANDRA_HOME/lib/jamm-0.2.5.jar"

# some JVMs will fill up their heap when accessed via JMX, see CASSANDRA-6541
JVM_OPTS="$JVM_OPTS -XX:+CMSClassUnloadingEnabled"

# enable thread priorities, primarily so we can give periodic tasks
# a lower priority to avoid interfering with client workload
JVM_OPTS="$JVM_OPTS -XX:+UseThreadPriorities"
# allows lowering thread priority without being root.  see
# http://tech.stolsvik.com/2010/01/linux-java-thread-priorities-workaround.html
JVM_OPTS="$JVM_OPTS -XX:ThreadPriorityPolicy=42"

# min and max heap sizes should be set to the same value to avoid
# stop-the-world GC pauses during resize, and so that we can lock the
# heap in memory on startup to prevent any of it from being swapped
# out.
JVM_OPTS="$JVM_OPTS -Xms${MAX_HEAP_SIZE}"
JVM_OPTS="$JVM_OPTS -Xmx${MAX_HEAP_SIZE}"
JVM_OPTS="$JVM_OPTS -Xmn${HEAP_NEWSIZE}"
JVM_OPTS="$JVM_OPTS -XX:+HeapDumpOnOutOfMemoryError"

# set jvm HeapDumpPath with CASSANDRA_HEAPDUMP_DIR
if [ "x$CASSANDRA_HEAPDUMP_DIR" != "x" ]; then
    JVM_OPTS="$JVM_OPTS -XX:HeapDumpPath=$CASSANDRA_HEAPDUMP_DIR/cassandra-`date +%s`-pid$$.hprof"
fi


startswith() { [ "${1#$2}" != "$1" ]; }

# Per-thread stack size.
JVM_OPTS="$JVM_OPTS -Xss256k"

# Larger interned string table, for gossip's benefit (CASSANDRA-6410)
JVM_OPTS="$JVM_OPTS -XX:StringTableSize=1000003"

# GC tuning options
JVM_OPTS="$JVM_OPTS -XX:+UseG1GC"
JVM_OPTS="$JVM_OPTS -XX:MaxGCPauseMillis=10"

#JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC" 
#JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled" 
#JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8" 
#JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=1"
#JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75"
#JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
#JVM_OPTS="$JVM_OPTS -XX:+UseTLAB"
--------------------------------------------------

NUMA settings in bin/cassandra

{quote}
NUMACTL_ARGS="--cpubind=0 --localalloc"
if which numactl >/dev/null 2>/dev/null && numactl $NUMACTL_ARGS ls / >/dev/null
2>/dev/null
then
    NUMACTL="numactl $NUMACTL_ARGS"
else
    NUMACTL=""
fi
{quote}



Then we threw our biggest batch processing job at. It run the whole night, creating probably
around 200 million columns during the processing run,
The writes are so large we have to increase thrift frame size to 60 MB to accommodate some
of the largest batches.
This was also locked to a different NUMA node with 16 cores.

On top of that we threw a reader process that peels of events related to the creation of those
200 mln columns and synchronizes an external data destination.
We do this via a message queue in Cassandra. We use all sorts of nifty tricks in there to
avoid tombstone exceptions during queries.
So this process did massively heavy reads while the other one was doing massive writes. Also
NUMA locked to a different node with 16 cores.

After a whole night of processing, Cassandra was doing just fine. Not a single instance of
STW GC occurred:

{quote}
-bash-4.1$ jstat -gc 72811 5s

 S0C    S1C    S0U    S1U      EC       EU        OC         OU       PC     PU    YGC   
 YGCT    FGC    FGCT     GCT   
 0.0   180224.0  0.0   180224.0 6430720.0 5292032.0 18554880.0 6491303.4  32768.0 27272.2
 20270 6859.859   0      0.000 6859.859
 0.0   163840.0  0.0   163840.0 6447104.0 2957312.0 18554880.0 6524999.7  32768.0 27272.2
 20274 6860.643   0      0.000 6860.643
 0.0   172032.0  0.0   172032.0 6438912.0 5939200.0 18554880.0 6543167.4  32768.0 27272.2
 20277 6861.222   0      0.000 6861.222
 0.0   172032.0  0.0   172032.0 6438912.0 6127616.0 18554880.0 6566848.4  32768.0 27272.2
 20281 6861.806   0      0.000 6861.806
{quote}

So if you ask me, the whole off-heap row cache really does not give much, considering the
overhead it still has to maintain within heap.

I think Cassandra should experiment with keeping the entire cache in heap without any Unsafe
shenanigans and use G1 to manage it.

Hope this helps.





was (Author: jfurmankiewicz):
For what it's worth, here are some results of our testing. I hope this will be helpful.

We set Cassandra to 24 GB heap, row cache to 16 GB. Lucked via numactl cpubind to a single
NUMA node with 16 cores (on a  64 core server).

We changed all the JVM_OPTS in bin/cassandra to simply use G1 instead of all the other GC
settings

{quote}
# enable assertions.  disabling this in production will give a modest
# performance benefit (around 5%).
JVM_OPTS="$JVM_OPTS -ea"

# add the jamm javaagent
JVM_OPTS="$JVM_OPTS -javaagent:$CASSANDRA_HOME/lib/jamm-0.2.5.jar"

# some JVMs will fill up their heap when accessed via JMX, see CASSANDRA-6541
JVM_OPTS="$JVM_OPTS -XX:+CMSClassUnloadingEnabled"

# enable thread priorities, primarily so we can give periodic tasks
# a lower priority to avoid interfering with client workload
JVM_OPTS="$JVM_OPTS -XX:+UseThreadPriorities"
# allows lowering thread priority without being root.  see
# http://tech.stolsvik.com/2010/01/linux-java-thread-priorities-workaround.html
JVM_OPTS="$JVM_OPTS -XX:ThreadPriorityPolicy=42"

# min and max heap sizes should be set to the same value to avoid
# stop-the-world GC pauses during resize, and so that we can lock the
# heap in memory on startup to prevent any of it from being swapped
# out.
JVM_OPTS="$JVM_OPTS -Xms${MAX_HEAP_SIZE}"
JVM_OPTS="$JVM_OPTS -Xmx${MAX_HEAP_SIZE}"
JVM_OPTS="$JVM_OPTS -Xmn${HEAP_NEWSIZE}"
JVM_OPTS="$JVM_OPTS -XX:+HeapDumpOnOutOfMemoryError"

# set jvm HeapDumpPath with CASSANDRA_HEAPDUMP_DIR
if [ "x$CASSANDRA_HEAPDUMP_DIR" != "x" ]; then
    JVM_OPTS="$JVM_OPTS -XX:HeapDumpPath=$CASSANDRA_HEAPDUMP_DIR/cassandra-`date +%s`-pid$$.hprof"
fi


startswith() { [ "${1#$2}" != "$1" ]; }

# Per-thread stack size.
JVM_OPTS="$JVM_OPTS -Xss256k"

# Larger interned string table, for gossip's benefit (CASSANDRA-6410)
JVM_OPTS="$JVM_OPTS -XX:StringTableSize=1000003"

# GC tuning options
JVM_OPTS="$JVM_OPTS -XX:+UseG1GC"
JVM_OPTS="$JVM_OPTS -XX:MaxGCPauseMillis=10"

#JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC" 
#JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled" 
#JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8" 
#JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=1"
#JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75"
#JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
#JVM_OPTS="$JVM_OPTS -XX:+UseTLAB"
{quote}

NUMA settings in bin/cassandra

{quote}
NUMACTL_ARGS="--cpubind=0 --localalloc"
if which numactl >/dev/null 2>/dev/null && numactl $NUMACTL_ARGS ls / >/dev/null
2>/dev/null
then
    NUMACTL="numactl $NUMACTL_ARGS"
else
    NUMACTL=""
fi
{quote}



Then we threw our biggest batch processing job at. It run the whole night, creating probably
around 200 million columns during the processing run,
The writes are so large we have to increase thrift frame size to 60 MB to accommodate some
of the largest batches.
This was also locked to a different NUMA node with 16 cores.

On top of that we threw a reader process that peels of events related to the creation of those
200 mln columns and synchronizes an external data destination.
We do this via a message queue in Cassandra. We use all sorts of nifty tricks in there to
avoid tombstone exceptions during queries.
So this process did massively heavy reads while the other one was doing massive writes. Also
NUMA locked to a different node with 16 cores.

After a whole night of processing, Cassandra was doing just fine. Not a single instance of
STW GC occurred:

{quote}
-bash-4.1$ jstat -gc 72811 5s

 S0C    S1C    S0U    S1U      EC       EU        OC         OU       PC     PU    YGC   
 YGCT    FGC    FGCT     GCT   
 0.0   180224.0  0.0   180224.0 6430720.0 5292032.0 18554880.0 6491303.4  32768.0 27272.2
 20270 6859.859   0      0.000 6859.859
 0.0   163840.0  0.0   163840.0 6447104.0 2957312.0 18554880.0 6524999.7  32768.0 27272.2
 20274 6860.643   0      0.000 6860.643
 0.0   172032.0  0.0   172032.0 6438912.0 5939200.0 18554880.0 6543167.4  32768.0 27272.2
 20277 6861.222   0      0.000 6861.222
 0.0   172032.0  0.0   172032.0 6438912.0 6127616.0 18554880.0 6566848.4  32768.0 27272.2
 20281 6861.806   0      0.000 6861.806
{quote}

So if you ask me, the whole off-heap row cache really does not give much, considering the
overhead it still has to maintain within heap.

I think Cassandra should experiment with keeping the entire cache in heap without any Unsafe
shenanigans and use G1 to manage it.

Hope this helps.




> Cassandra locks up in full GC when you assign the entire heap to row cache
> --------------------------------------------------------------------------
>
>                 Key: CASSANDRA-7361
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7361
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: Ubuntu, RedHat, JDK 1.7
>            Reporter: Jacek Furmankiewicz
>            Priority: Minor
>         Attachments: histogram.png, leaks_report.png, top_consumers.png
>
>
> We have a long running batch load process, which runs for many hours.
> Massive amount of writes, in large mutation batches (we increase the thrift frame size
to 45 MB).
> Everything goes well, but after about 3 hrs of processing everything locks up. We start
getting NoHostsAvailable exceptions on the Java application side (with Astyanax as our driver),
eventually socket timeouts.
> Looking at Cassandra, we can see that it is using nearly the full 8GB of heap and unable
to free it. It spends most of its time in full GC, but the amount of memory does not go down.
> Here is a long sample from jstat to show this over an extended time period
> e.g.
> http://aep.appspot.com/display/NqqEagzGRLO_pCP2q8hZtitnuVU/
> This continues even after we shut down our app. Nothing is connected to Cassandra any
more, yet it is still stuck in full GC and cannot free up memory.
> Running nodetool tpstats shows that nothing is pending, all seems OK:
> {quote}
> Pool Name                    Active   Pending      Completed   Blocked  All time blocked
> ReadStage                         0         0       69555935         0              
  0
> RequestResponseStage              0         0              0         0              
  0
> MutationStage                     0         0       73123690         0              
  0
> ReadRepairStage                   0         0              0         0              
  0
> ReplicateOnWriteStage             0         0              0         0              
  0
> GossipStage                       0         0              0         0              
  0
> CacheCleanupExecutor              0         0              0         0              
  0
> MigrationStage                    0         0             46         0              
  0
> MemoryMeter                       0         0           1125         0              
  0
> FlushWriter                       0         0            824         0              
 30
> ValidationExecutor                0         0              0         0              
  0
> InternalResponseStage             0         0             23         0              
  0
> AntiEntropyStage                  0         0              0         0              
  0
> MemtablePostFlusher               0         0           1783         0              
  0
> MiscStage                         0         0              0         0              
  0
> PendingRangeCalculator            0         0              1         0              
  0
> CompactionExecutor                0         0          74330         0              
  0
> commitlog_archiver                0         0              0         0              
  0
> HintedHandoff                     0         0              0         0              
  0
> Message type           Dropped
> RANGE_SLICE                  0
> READ_REPAIR                  0
> PAGED_RANGE                  0
> BINARY                       0
> READ                       585
> MUTATION                 75775
> _TRACE                       0
> REQUEST_RESPONSE             0
> COUNTER_MUTATION             0
> {quote}
> We had this happen on 2 separate boxes, one with 2.0.6, the other with 2.0.8.
> Right now this is a total blocker for us. We are unable to process the customer data
and have to abort in the middle of large processing.
> This is a new customer, so we did not have a chance to see if this occurred with 1.1
or 1.2 in the past (we moved to 2.0 recently).
> We have the Cassandra process still running, pls let us know if there is anything else
we could run to give you more insight.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message