incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Timo Nentwig <timo.nent...@toptarif.de>
Subject Re: Memory leak with Sun Java 1.6 ?
Date Tue, 14 Dec 2010 15:15:02 GMT
On Dec 14, 2010, at 15:31, Timo Nentwig wrote:

> On Dec 14, 2010, at 14:41, Jonathan Ellis wrote:
> 
>> This is "A row has grown too large" section from that troubleshooting guide.
> 
> Why? This is what a typical "row" (?) looks like:
> 
> [default@test] list tracking limit 1;
> -------------------
> RowKey:  123
> => (column=key, value=foo, timestamp=1292238005886000)
> => (column=value, value=bar, timestamp=1292238005886000)
> 
> I'm importing the data from a RDBMS so I can say for sure that none of the 2 columns
will contain more than 255 chars.

Started all over again. Deleleted /var/lib/cassandra/*, started 4 nodes (2x2G+2x1G heap, actual
RAM twice as much, swap off) and my stress test: first node (1GB) crashed after 212s (!).
Wrote roughly about 200k rows like the one above (data wasn't even flushed to disk!).

 INFO [main] 2010-12-14 15:58:23,938 CassandraDaemon.java (line 119) Listening for thrift
clients...
 INFO [GossipStage:1] 2010-12-14 15:58:24,410 Gossiper.java (line 569) InetAddress /192.168.68.80
is now UP
 INFO [HintedHandoff:1] 2010-12-14 15:58:24,410 HintedHandOffManager.java (line 191) Started
hinted handoff for endpoint /192.168.68.80
 INFO [HintedHandoff:1] 2010-12-14 15:58:24,412 HintedHandOffManager.java (line 247) Finished
hinted handoff of 0 rows to endpoint /192.168.68.80
 INFO [GossipStage:1] 2010-12-14 15:58:26,335 Gossiper.java (line 577) Node /192.168.68.69
is now part of the cluster
 INFO [GossipStage:1] 2010-12-14 15:58:27,319 Gossiper.java (line 569) InetAddress /192.168.68.69
is now UP
 INFO [HintedHandoff:1] 2010-12-14 15:58:27,319 HintedHandOffManager.java (line 191) Started
hinted handoff for endpoint /192.168.68.69
 INFO [HintedHandoff:1] 2010-12-14 15:58:27,320 HintedHandOffManager.java (line 247) Finished
hinted handoff of 0 rows to endpoint /192.168.68.69
 INFO [GossipStage:1] 2010-12-14 15:58:29,446 Gossiper.java (line 577) Node /192.168.68.70
is now part of the cluster
 INFO [GossipStage:1] 2010-12-14 15:58:29,620 Gossiper.java (line 569) InetAddress /192.168.68.70
is now UP
 INFO [HintedHandoff:1] 2010-12-14 15:58:29,621 HintedHandOffManager.java (line 191) Started
hinted handoff for endpoint /192.168.68.70
 INFO [HintedHandoff:1] 2010-12-14 15:58:29,621 HintedHandOffManager.java (line 247) Finished
hinted handoff of 0 rows to endpoint /192.168.68.70
 INFO [MigrationStage:1] 2010-12-14 16:01:16,535 ColumnFamilyStore.java (line 639) switching
in a fresh Memtable for Migrations at CommitLogContext(file='/var/lib/cassandra/commitlog/CommitLog-1292338703483.log',
position=15059)
 INFO [MigrationStage:1] 2010-12-14 16:01:16,536 ColumnFamilyStore.java (line 943) Enqueuing
flush of Memtable-Migrations@768974140(9092 bytes, 1 operations)
 INFO [MigrationStage:1] 2010-12-14 16:01:16,536 ColumnFamilyStore.java (line 639) switching
in a fresh Memtable for Schema at CommitLogContext(file='/var/lib/cassandra/commitlog/CommitLog-1292338703483.log',
position=15059)
 INFO [MigrationStage:1] 2010-12-14 16:01:16,536 ColumnFamilyStore.java (line 943) Enqueuing
flush of Memtable-Schema@591783334(3765 bytes, 3 operations)
 INFO [FlushWriter:1] 2010-12-14 16:01:16,541 Memtable.java (line 155) Writing Memtable-Migrations@768974140(9092
bytes, 1 operations)
 INFO [FlushWriter:1] 2010-12-14 16:01:16,633 Memtable.java (line 162) Completed flushing
/var/lib/cassandra/data/system/Migrations-e-1-Data.db (9225 bytes)
 INFO [FlushWriter:1] 2010-12-14 16:01:16,634 Memtable.java (line 155) Writing Memtable-Schema@591783334(3765
bytes, 3 operations)
 INFO [FlushWriter:1] 2010-12-14 16:01:16,706 Memtable.java (line 162) Completed flushing
/var/lib/cassandra/data/system/Schema-e-1-Data.db (4053 bytes)
 INFO [Create index Indexed1.626972746864617465] 2010-12-14 16:01:16,718 ColumnFamilyStore.java
(line 325) Creating index org.apache.cassandra.db.Table@354d581b.Indexed1.626972746864617465
 INFO [MigrationStage:1] 2010-12-14 16:01:16,725 ColumnFamilyStore.java (line 639) switching
in a fresh Memtable for Migrations at CommitLogContext(file='/var/lib/cassandra/commitlog/CommitLog-1292338703483.log',
position=27559)
 INFO [MigrationStage:1] 2010-12-14 16:01:16,726 ColumnFamilyStore.java (line 943) Enqueuing
flush of Memtable-Migrations@1922556837(7973 bytes, 1 operations)
 INFO [Create index Indexed1.626972746864617465] 2010-12-14 16:01:16,729 ColumnFamilyStore.java
(line 339) Index Indexed1.626972746864617465 complete
 INFO [MigrationStage:1] 2010-12-14 16:01:16,730 ColumnFamilyStore.java (line 639) switching
in a fresh Memtable for Schema at CommitLogContext(file='/var/lib/cassandra/commitlog/CommitLog-1292338703483.log',
position=27559)
 INFO [FlushWriter:1] 2010-12-14 16:01:16,730 Memtable.java (line 155) Writing Memtable-Migrations@1922556837(7973
bytes, 1 operations)
 INFO [MigrationStage:1] 2010-12-14 16:01:16,730 ColumnFamilyStore.java (line 943) Enqueuing
flush of Memtable-Schema@1373806697(4029 bytes, 4 operations)
 INFO [Create index Indexed1.626972746864617465] 2010-12-14 16:01:16,733 ColumnFamilyStore.java
(line 639) switching in a fresh Memtable for IndexInfo at CommitLogContext(file='/var/lib/cassandra/commitlog/CommitLog-1292338703483.log',
position=27707)
 INFO [Create index Indexed1.626972746864617465] 2010-12-14 16:01:16,734 ColumnFamilyStore.java
(line 943) Enqueuing flush of Memtable-IndexInfo@1972467582(42 bytes, 1 operations)
 INFO [FlushWriter:1] 2010-12-14 16:01:16,808 Memtable.java (line 162) Completed flushing
/var/lib/cassandra/data/system/Migrations-e-2-Data.db (8106 bytes)
 INFO [FlushWriter:1] 2010-12-14 16:01:16,816 Memtable.java (line 155) Writing Memtable-Schema@1373806697(4029
bytes, 4 operations)
 INFO [FlushWriter:1] 2010-12-14 16:01:16,902 Memtable.java (line 162) Completed flushing
/var/lib/cassandra/data/system/Schema-e-2-Data.db (4317 bytes)
 INFO [FlushWriter:1] 2010-12-14 16:01:16,904 Memtable.java (line 155) Writing Memtable-IndexInfo@1972467582(42
bytes, 1 operations)
 INFO [FlushWriter:1] 2010-12-14 16:01:16,968 Memtable.java (line 162) Completed flushing
/var/lib/cassandra/data/system/IndexInfo-e-1-Data.db (170 bytes)
ERROR [MutationStage:22] 2010-12-14 16:02:00,283 AbstractCassandraDaemon.java (line 90) Fatal
exception in thread Thread[MutationStage:22,5,main]
java.lang.OutOfMemoryError: Java heap space
        at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:39)
        at java.nio.ByteBuffer.allocate(ByteBuffer.java:312)
        at org.apache.cassandra.utils.FBUtilities.readByteArray(FBUtilities.java:261)
        at org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:76)
        at org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:35)
        at org.apache.cassandra.db.ColumnFamilySerializer.deserializeColumns(ColumnFamilySerializer.java:129)
        at org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:120)
        at org.apache.cassandra.db.RowMutationSerializer.defreezeTheMaps(RowMutation.java:383)
        at org.apache.cassandra.db.RowMutationSerializer.deserialize(RowMutation.java:393)
        at org.apache.cassandra.db.RowMutationSerializer.deserialize(RowMutation.java:351)
        at org.apache.cassandra.db.RowMutationVerbHandler.doVerb(RowMutationVerbHandler.java:52)
        at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:63)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)

gc:

210.919: [GC 210.919: [ParNew
Desired survivor size 2162688 bytes, new threshold 1 (max 1)
- age   1:    1833480 bytes,    1833480 total
: 16139K->4221K(38336K), 0.0377710 secs]210.957: [CMS: 690490K->171727K(1006016K), 1.0604710
secs] 702251K->171727K(1044352K), [CMS Perm : 17750K->17734K(21248K)], 1.0985490 secs]
[Times: user=1.11 sys=0.01, real=1.10 secs] 
212.017: [Full GC 212.017: [CMS: 171727K->171315K(1006016K), 1.0312260 secs] 171727K->171315K(1044352K),
[CMS Perm : 17734K->17504K(29560K)], 1.0314660 secs] [Times: user=1.03 sys=0.00, real=1.03
secs] 
Total time for which application threads were stopped: 2.1309030 seconds
Total time for which application threads were stopped: 4.4463770 seconds
Heap
 par new generation   total 38336K, used 15546K [0x00000000bae00000, 0x00000000bd790000, 0x00000000bd790000)
  eden space 34112K,  45% used [0x00000000bae00000, 0x00000000bbd2ebb8, 0x00000000bcf50000)
  from space 4224K,   0% used [0x00000000bd370000, 0x00000000bd370000, 0x00000000bd790000)
  to   space 4224K,   0% used [0x00000000bcf50000, 0x00000000bcf50000, 0x00000000bd370000)
 concurrent mark-sweep generation total 1006016K, used 171315K [0x00000000bd790000, 0x00000000fae00000,
0x00000000fae00000)
 concurrent-mark-sweep perm gen total 29560K, used 17572K [0x00000000fae00000, 0x00000000fcade000,
0x0000000100000000)


During this short time I experienced 4 times (mentioned this already on the dev list):

me.prettyprint.hector.api.exceptions.HCassandraInternalException: Cassandra encountered an
internal error processing this request: TApplicationError type: 6 message:Internal error processing
batch_mutate
	at me.prettyprint.cassandra.service.ExceptionsTranslatorImpl.translate(ExceptionsTranslatorImpl.java:27)
	at me.prettyprint.cassandra.service.KeyspaceServiceImpl$1.execute(KeyspaceServiceImpl.java:95)
	at me.prettyprint.cassandra.service.KeyspaceServiceImpl$1.execute(KeyspaceServiceImpl.java:1)
	at me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:89)
	at me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:142)
	at me.prettyprint.cassandra.service.KeyspaceServiceImpl.operateWithFailover(KeyspaceServiceImpl.java:129)
	at me.prettyprint.cassandra.service.KeyspaceServiceImpl.batchMutate(KeyspaceServiceImpl.java:100)
	at me.prettyprint.cassandra.service.KeyspaceServiceImpl.batchMutate(KeyspaceServiceImpl.java:106)
	at me.prettyprint.cassandra.model.MutatorImpl$2.doInKeyspace(MutatorImpl.java:149)
	at me.prettyprint.cassandra.model.MutatorImpl$2.doInKeyspace(MutatorImpl.java:1)
	at me.prettyprint.cassandra.model.KeyspaceOperationCallback.doInKeyspaceAndMeasure(KeyspaceOperationCallback.java:20)
	at me.prettyprint.cassandra.model.ExecutingKeyspace.doExecute(ExecutingKeyspace.java:65)
	at me.prettyprint.cassandra.model.MutatorImpl.execute(MutatorImpl.java:146)
	at Trackr$3.run(Trackr.java:133)
	at Trackr$BoundedExecutor$1.run(Trackr.java:167)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
	at java.lang.Thread.run(Thread.java:680)

> That row warning seems to be a 0.6 option, don't find it in 0.7 docs. Didn't find any
related WARNs for some default value in the log also.
> 
>> On Tue, Dec 14, 2010 at 5:27 AM, Timo Nentwig <timo.nentwig@toptarif.de> wrote:
>> 
>> On Dec 12, 2010, at 17:21, Jonathan Ellis wrote:
>> 
>>> http://www.riptano.com/docs/0.6/troubleshooting/index#nodes-are-dying-with-oom-errors
>> 
>> I can rule out the first 3. I was running cassandra with default settings, i.e. 1GB
heap and 256M memtable. So, with 3 memtables+1GB the JVM should run with >1.75G (although
http://wiki.apache.org/cassandra/MemtableThresholds considers to increase heap size only gently).
Did so. 4GB machine with 2GB 64bit-JVM seemed to run stable for quite some time but then also
crashed with OOM. Looking at the heap dump it's always the same: all memory nearly always
bound in CompactionExecutor (ColumnFamilyStore/ConcurrentSkipListMap, respectively).
>> 
>> This looks like somebody else recently have had a similar problem (->Bottom line:
more heap - which is okay, but I'd like to understand why):
>> http://www.mail-archive.com/user@cassandra.apache.org/msg07516.html
>> 
>> This is my only CF currently in use (via JMX):
>> 
>> - column_families:
>> - column_type: Standard
>>   comment: tracking column family
>>   compare_with: org.apache.cassandra.db.marshal.UTF8Type
>>   default_validation_class: org.apache.cassandra.db.marshal.UTF8Type
>>   gc_grace_seconds: 864000
>>   key_cache_save_period_in_seconds: 3600
>>   keys_cached: 200000.0
>>   max_compaction_threshold: 32
>>   memtable_flush_after_mins: 60
>>   min_compaction_threshold: 4
>>   name: tracking
>>   read_repair_chance: 1.0
>>   row_cache_save_period_in_seconds: 0
>>   rows_cached: 0.0
>> name: test
>> replica_placement_strategy: org.apache.cassandra.locator.SimpleStrategy
>> replication_factor: 3
>> 
>> 
>> In addition...actually there is plenty of free memory on the heap (?):
>> 
>> 3605.478: [GC 3605.478: [ParNew
>> Desired survivor size 2162688 bytes, new threshold 1 (max 1)
>> - age   1:     416112 bytes,     416112 total
>> : 16887K->553K(38336K), 0.0209550 secs]3605.499: [CMS: 1145267K->447565K(2054592K),
1.9143630 secs] 1161938K->447565K(2092928K), [CMS Perm : 18186K->18158K(30472K)], 1.9355340
secs] [Times: user=1.95 sys=0.00, real=1.94 secs]
>> 3607.414: [Full GC 3607.414: [CMS: 447565K->447453K(2054592K), 1.9694960 secs]
447565K->447453K(2092928K), [CMS Perm : 18158K->18025K(30472K)], 1.9696450 secs] [Times:
user=1.92 sys=0.00, real=1.97 secs]
>> Total time for which application threads were stopped: 3.9070380 seconds
>> Total time for which application threads were stopped: 7.3388640 seconds
>> Total time for which application threads were stopped: 0.0560610 seconds
>> 3616.931: [GC 3616.931: [ParNew
>> Desired survivor size 2162688 bytes, new threshold 1 (max 1)
>> - age   1:     474264 bytes,     474264 total
>> : 34112K->747K(38336K), 0.0098680 secs] 481565K->448201K(2092928K), 0.0099690
secs] [Times: user=0.00 sys=0.00, real=0.01 secs]
>> Total time for which application threads were stopped: 0.0108670 seconds
>> 3617.035: [GC 3617.035: [ParNew
>> Desired survivor size 2162688 bytes, new threshold 1 (max 1)
>> - age   1:      63040 bytes,      63040 total
>> : 34859K->440K(38336K), 0.0065950 secs] 482313K->448455K(2092928K), 0.0066880
secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
>> Total time for which application threads were stopped: 0.0075850 seconds
>> 3617.133: [GC 3617.133: [ParNew
>> Desired survivor size 2162688 bytes, new threshold 1 (max 1)
>> - age   1:      23016 bytes,      23016 total
>> : 34552K->121K(38336K), 0.0042920 secs] 482567K->448193K(2092928K), 0.0043650
secs] [Times: user=0.01 sys=0.00, real=0.00 secs]
>> Total time for which application threads were stopped: 0.0049630 seconds
>> 3617.228: [GC 3617.228: [ParNew
>> Desired survivor size 2162688 bytes, new threshold 1 (max 1)
>> - age   1:      16992 bytes,      16992 total
>> : 34233K->34K(38336K), 0.0043180 secs] 482305K->448122K(2092928K), 0.0043910
secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
>> Total time for which application threads were stopped: 0.0049150 seconds
>> 3617.323: [GC 3617.323: [ParNew
>> Desired survivor size 2162688 bytes, new threshold 1 (max 1)
>> - age   1:      18456 bytes,      18456 total
>> : 34146K->29K(38336K), 0.0038930 secs] 482234K->448127K(2092928K), 0.0039810
secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
>> Total time for which application threads were stopped: 0.0055390 seconds
>> Heap
>> par new generation   total 38336K, used 17865K [0x000000077ae00000, 0x000000077d790000,
0x000000077d790000)
>> eden space 34112K,  52% used [0x000000077ae00000, 0x000000077bf6afb0, 0x000000077cf50000)
>> from space 4224K,   0% used [0x000000077cf50000, 0x000000077cf57720, 0x000000077d370000)
>> to   space 4224K,   0% used [0x000000077d370000, 0x000000077d370000, 0x000000077d790000)
>> concurrent mark-sweep generation total 2054592K, used 448097K [0x000000077d790000,
0x00000007fae00000, 0x00000007fae00000)
>> concurrent-mark-sweep perm gen total 30472K, used 18125K [0x00000007fae00000, 0x00000007fcbc2000,
0x0000000800000000)
>> 
>>> 
>>> On Sun, Dec 12, 2010 at 9:52 AM, Timo Nentwig <timo.nentwig@toptarif.de>
wrote:
>>> 
>>> On Dec 10, 2010, at 19:37, Peter Schuller wrote:
>>> 
>>>> To cargo cult it: Are you running a modern JVM? (Not e.g. openjdk b17
>>>> in lenny or some such.) If it is a JVM issue, ensuring you're using a
>>>> reasonably recent JVM is probably much easier than to start tracking
>>>> it down...
>>> 
>>> I had OOM problems with OpenJDK, switched to Sun/Oracle's recent 1.6.0_23 and...still
have the same problem :-\ Stack trace always looks the same:
>>> 
>>> java.lang.OutOfMemoryError: Java heap space
>>>       at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
>>>       at java.nio.ByteBuffer.allocate(ByteBuffer.java:329)
>>>       at org.apache.cassandra.utils.FBUtilities.readByteArray(FBUtilities.java:261)
>>>       at org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:76)
>>>       at org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:35)
>>>       at org.apache.cassandra.db.ColumnFamilySerializer.deserializeColumns(ColumnFamilySerializer.java:129)
>>>       at org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:120)
>>>       at org.apache.cassandra.db.RowMutationSerializer.defreezeTheMaps(RowMutation.java:383)
>>>       at org.apache.cassandra.db.RowMutationSerializer.deserialize(RowMutation.java:393)
>>>       at org.apache.cassandra.db.RowMutationSerializer.deserialize(RowMutation.java:351)
>>>       at org.apache.cassandra.db.RowMutationVerbHandler.doVerb(RowMutationVerbHandler.java:52)
>>>       at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:63)
>>>       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>>>       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>>>       at java.lang.Thread.run(Thread.java:636)
>>> 
>>> I'm writing from 1 client with 50 threads to a cluster of 4 machines (with hector).
With QUORUM and ONE 2 machines quite reliably will soon die with OOM. What may cause this?
Won't cassandra block/reject when memtable is full and being flushed to disk but grow and
if flushing to disk isn't fast enough will run out of memory?
>>> 
>>> 
>>> 
>>> --
>>> Jonathan Ellis
>>> Project Chair, Apache Cassandra
>>> co-founder of Riptano, the source for professional Cassandra support
>>> http://riptano.com
>> 
>> 
>> 
>> 
>> -- 
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of Riptano, the source for professional Cassandra support
>> http://riptano.com
> 


Mime
View raw message