cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "sunjian (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CASSANDRA-5205) The first three Cassandra node is very busy , GC pause the world (Real production Env. Exp.)
Date Thu, 31 Jan 2013 04:13:14 GMT

     [ https://issues.apache.org/jira/browse/CASSANDRA-5205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

sunjian updated CASSANDRA-5205:
-------------------------------

    Description: 
hi dear cares , 

I have 10 nodes before , all on the centos VM with 16GB ram and 8core CPU , and running the
cassandra 1.1.5 with only one User keyspace (RF=3) . Heap(Old:8GB,New:2GB)

matters :
1. the first three nodes (from token 0) goes very busy all the time , but the left 7 nodes
seems nothing to do , both the CPU and RAM was freely .

2. all of the first three nodes' JVM ram cost increasing crazy , CMS GC fires nearly every
seconds

3. when GC happened , the world seems stopped . checking via node tool , when running node
tool on the first three node , nodetool will hung up . when running on the left 7 nodes ,
it shows that the first three node down

4. when GC finished , the node comes back , but it will gone in mins later .

5. kill java process , reboot the frozen node , it will up in mins , and the JVM ram will
be increasing full in mins as well , and everythings above repeating ....

6. even if only one of the first three node frozen , the client request will failed . but
my client request CL=QUORUM , and I am playing with hector client lib.

7. disable the three nodes' thrift api , nothing changed.

############change#############
0. stop the coming user request (stop our user service to make cassandra free)
1. decommission 4 nodes (one by one)
2. moving tokens to banlance the left 6 nodes (one by one) 
3. change the left 6 node resource to : 30GB RAM 16core CPU , heap(16G old , 4GB new)
4. enable JNA
5. do major compaction on the 6nodes , do repair on the 6nodes
6. start the new cluster ...
7. everything seems ok in the early running time , but 5hours past , every bad matters come
back .
8. because of we have got double RAM now , the dead repeating cycle goes hourly
9. JVM opts : 	-ea -javaagent:./../lib/jamm-0.2.5.jar -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42
-Xms16G -Xmx16G -Xmn4G -Xss180k -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled
-XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly
-Djava.net.preferIPv4Stack=true -Djava.rmi.server.hostname=10.0.0.22 -Dcom.sun.management.jmxremote.port=7199
-Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false
-Dlog4j.configuration=log4j-server.properties -Dlog4j.defaultInitOverride=true

some screen short attached .


  was:
hi dear cares , 

I have 10 nodes before , all on the centos VM with 16GB ram and 8core CPU , and running the
cassandra 1.1.5 with only one User keyspace (RF=3) . Heap(Old:8GB,New:2GB)

matters :
1. the first three nodes (from token 0) goes very busy all the time , but the left 7 nodes
seems nothing to do , both the CPU and RAM was freely .

2. all of the first three nodes' JVM ram cost increasing crazy , CMS GC fires nearly every
seconds

3. when GC happened , the world seems stopped . checking via node tool , when running node
tool on the first three node , nodetool will hung up . when running on the left 7 nodes ,
it shows that the first three node down

4. when GC finished , the node comes back , but it will gone in mins later .

5. kill java process , reboot the frozen node , it will up in mins , and the JVM ram will
be increasing full in mins as well , and everythings above repeating ....

6. even if only one of the first three node frozen , the client request will failed . but
my client request CL=QUORUM , and I am playing with hector client lib.

7. disable the three nodes' thrift api , nothing changed.

############change#############
0. stop the coming user request (stop our user service to make cassandra free)
1. decommission 4 nodes (one by one)
2. moving tokens to banlance the left 6 nodes (one by one) 
3. change the left 6 node resource to : 30GB RAM 16core CPU , heap(16G old , 4GB new)
4. enable JNA
5. do major compaction on the 6nodes , do repair on the 6nodes
6. start the new cluster ...
7. everything seems ok in the early running time , but 5hours past , every bad matters come
back .
8. because of we have got double RAM now , the dead repeating cycle goes hourly


some screen short attached .


    
> The first three Cassandra node is very busy , GC pause the world (Real production Env.
Exp.)
> --------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-5205
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-5205
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 1.1.5
>         Environment: cassandra 1.1.5 release
> centos 5.5  
> jdk1.7u9 
> vmware(TM)'s exsi based VM : 30GB RAM , 4*4core CPU 
> Hard ware : Dell R720 , 2*6core CPU , 128GB RAM , made 3 node as above
> data hosted by each node : about 8GB
>            Reporter: sunjian
>            Priority: Minor
>             Fix For: 1.1.10
>
>         Attachments: the-normal-free-node-no-presure.jpg, the-trouble-maker-node.jpg
>
>
> hi dear cares , 
> I have 10 nodes before , all on the centos VM with 16GB ram and 8core CPU , and running
the cassandra 1.1.5 with only one User keyspace (RF=3) . Heap(Old:8GB,New:2GB)
> matters :
> 1. the first three nodes (from token 0) goes very busy all the time , but the left 7
nodes seems nothing to do , both the CPU and RAM was freely .
> 2. all of the first three nodes' JVM ram cost increasing crazy , CMS GC fires nearly
every seconds
> 3. when GC happened , the world seems stopped . checking via node tool , when running
node tool on the first three node , nodetool will hung up . when running on the left 7 nodes
, it shows that the first three node down
> 4. when GC finished , the node comes back , but it will gone in mins later .
> 5. kill java process , reboot the frozen node , it will up in mins , and the JVM ram
will be increasing full in mins as well , and everythings above repeating ....
> 6. even if only one of the first three node frozen , the client request will failed .
but my client request CL=QUORUM , and I am playing with hector client lib.
> 7. disable the three nodes' thrift api , nothing changed.
> ############change#############
> 0. stop the coming user request (stop our user service to make cassandra free)
> 1. decommission 4 nodes (one by one)
> 2. moving tokens to banlance the left 6 nodes (one by one) 
> 3. change the left 6 node resource to : 30GB RAM 16core CPU , heap(16G old , 4GB new)
> 4. enable JNA
> 5. do major compaction on the 6nodes , do repair on the 6nodes
> 6. start the new cluster ...
> 7. everything seems ok in the early running time , but 5hours past , every bad matters
come back .
> 8. because of we have got double RAM now , the dead repeating cycle goes hourly
> 9. JVM opts : 	-ea -javaagent:./../lib/jamm-0.2.5.jar -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42
-Xms16G -Xmx16G -Xmn4G -Xss180k -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled
-XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly
-Djava.net.preferIPv4Stack=true -Djava.rmi.server.hostname=10.0.0.22 -Dcom.sun.management.jmxremote.port=7199
-Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false
-Dlog4j.configuration=log4j-server.properties -Dlog4j.defaultInitOverride=true
> some screen short attached .

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message