incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wayne <wav...@gmail.com>
Subject Re: Node OOM Problems
Date Thu, 19 Aug 2010 20:13:14 GMT
We are using the random partitioner. The tokens we defined manually and data
is almost totally equal among nodes, 15GB per node when the trouble started.
System vitals look fine. CPU load is ~500% for java, iostats are low,
everything for all practical purposes looks "normal". Cfstats reports ~1-2
ms for a read and ~.015-.02 for a write. All cache is now turned off as that
seems to cause issues a lot faster. In terms of the current load, it is
single insert statements. 6 nodes and 4 processes pointed to each node. That
is not blasting away in my book.

We can go a lot higher with memory, but with an 8g heap an NO cache thought
this was a good number. With cache we would go higher on the memory but we
are trying to get past this issue before even using cache. The possibility
of a rogue large row is one I never considered. The largest row on the other
nodes is as much as 800megs. I can not get a cfstats reading on the bad node
so do not know how big its largest row is. I will raise memory to 16gb and
see if that makes a difference. I had though that the java heap sizes that
high had issues on their own in term of GC.

Thanks.


On Thu, Aug 19, 2010 at 9:44 PM, Edward Capriolo <edlinuxguru@gmail.com>wrote:

> On Thu, Aug 19, 2010 at 2:48 PM, Wayne <wav100@gmail.com> wrote:
> > I am having some serious problems keeping a 6 node cluster up and running
> > and stable under load. Any help would be greatly appreciated.
> >
> > Basically it always comes back to OOM errors that never seem to subside.
> > After 5 minutes or 3 hours of heavy load depending on settings one or
> more
> > nodes seem to go down with an OOM error. Upon restart the node tries to
> get
> > going but ends up with an OOM error again and again and again. This
> patterns
> > repeats and repeats with no way out. Once a node goes south on me
> deleting
> > everything and starting from scratch is my only option. I can never get
> it
> > to stay up again.
> >
> > I will try to describe the latest in detail. I am running a migration
> from a
> > MySQL database with 24 processes against 6 nodes in python using the std
> > thrift client. Should be a piece of cake in cassandra but no matter what
> I
> > have tried it always seems to kill a node after not too long.
> >
> > 6 x Nodes with 8 cores, 8G heap, 24g physical
> > 0.6.4
> > JVM Options for lower compaction priority
> > JVM 1.6.0_21-b06
> >
> > After 3 hours one node hard crashed with an OOM error. See below for
> > messages.
> >
> > I restart and then see nothing but MessageDeserializationTask & some
> > GCInspector messages.
> >
> >  INFO [GC inspection] 2010-08-19 17:06:09,228 GCInspector.java (line 143)
> > MEMTABLE-POST-FLUSHER             1         1
> >  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 17:06:09,227
> > MessageDeserializationTask.java (line 47) dropping message (52,288ms past
> > timeout)
> >  INFO [GC inspection] 2010-08-19 17:06:09,229 GCInspector.java (line 143)
> > AE-SERVICE-STAGE                  0         0
> >  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 17:06:09,229
> > MessageDeserializationTask.java (line 47) dropping message (52,290ms past
> > timeout)
> >  INFO [GC inspection] 2010-08-19 17:06:09,229 GCInspector.java (line 143)
> > FLUSH-WRITER-POOL                 1         1
> >
> >
> > After the 2nd restart it dies/freezes completely without actually
> crashing
> > (kill -9 is required). I knew it was dead because nothing was logged for
> 30
> > minutes to the log and nodetool still registered the node as down.
> >
> > After the 3rd reboot it comes back UP but then eventually crashes again.
> See
> > below for details.
> >
> > I have tried using key or row cache but cache makes the problem happen in
> 5
> > minutes as there is less memory and the OOM happens almost immediately. I
> > have played with setting less concurrent readers and writers and now have
> it
> > set to 16/16 (read/write). I have tried to set the mmap_index_only with
> no
> > change as the issue is really a JVM issue. All other settings are
> > std/default.
> >
> > Does anyone keep their nodes up under load over time? I have been working
> > with cassandra for a while now and still have yet to keep anything up
> under
> > load for very long. I know nothing about java, and frankly am starting to
> > wonder if I need to be a java programmer to use use cassandra/run in
> debug
> > mode. Any help would be greatly appreciated. These issues in the past
> made
> > me try to use hbase which was solid as a rock (comparatively) but SLOW
> and
> > too complicated...I came back again thinking .6.4 and soon .7 would be
> > better but it almost seems worse to me in terms of stability. Frankly
> next I
> > will have to look at other alternatives because cassandra is totally
> > unstable. I do not want to give up after all this time but I am out of
> > ideas.
> >
> > HELP!
> >
> > Messages prior to first crash:
> >
> >  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 16:33:19,985
> > MessageDeserializationTask.java (line 47) dropping message (12,024ms past
> > timeout)
> >  INFO [GC inspection] 2010-08-19 16:33:19,985 GCInspector.java (line 143)
> > CONSISTENCY-MANAGER               0         0
> >  INFO [GC inspection] 2010-08-19 16:33:19,986 GCInspector.java (line 143)
> > LB-TARGET                         0         0
> >  INFO [GC inspection] 2010-08-19 16:33:19,986 GCInspector.java (line 143)
> > ROW-MUTATION-STAGE                1       167
> >  INFO [GC inspection] 2010-08-19 16:33:19,987 GCInspector.java (line 143)
> > MESSAGE-STREAMING-POOL            0         0
> >  INFO [GC inspection] 2010-08-19 16:33:19,987 GCInspector.java (line 143)
> > LOAD-BALANCER-STAGE               0         0
> >  INFO [GC inspection] 2010-08-19 16:33:19,987 GCInspector.java (line 143)
> > FLUSH-SORTER-POOL                 0         0
> >  INFO [GC inspection] 2010-08-19 16:33:19,987 GCInspector.java (line 143)
> > MEMTABLE-POST-FLUSHER             0         0
> >  INFO [GC inspection] 2010-08-19 16:33:19,987 GCInspector.java (line 143)
> > AE-SERVICE-STAGE                  0         0
> >  INFO [GC inspection] 2010-08-19 16:33:19,988 GCInspector.java (line 143)
> > FLUSH-WRITER-POOL                 0         0
> >  INFO [GC inspection] 2010-08-19 16:33:19,988 GCInspector.java (line 143)
> > HINTED-HANDOFF-POOL               1         1
> >  INFO [GC inspection] 2010-08-19 16:33:19,988 GCInspector.java (line 148)
> > CompactionManager               n/a         0
> >  INFO [GMFD:1] 2010-08-19 16:33:19,989 Gossiper.java (line 578)
> InetAddress
> > /10.4.0.202 is now UP
> >  INFO [GMFD:1] 2010-08-19 16:33:19,989 Gossiper.java (line 578)
> InetAddress
> > /10.4.0.203 is now UP
> >  INFO [GMFD:1] 2010-08-19 16:33:19,990 Gossiper.java (line 578)
> InetAddress
> > /10.4.0.201 is now UP
> >  INFO [GMFD:1] 2010-08-19 16:33:19,990 Gossiper.java (line 578)
> InetAddress
> > /10.4.0.204 is now UP
> >  INFO [GMFD:1] 2010-08-19 16:33:19,990 Gossiper.java (line 578)
> InetAddress
> > /10.4.0.205 is now UP
> >  INFO [ROW-MUTATION-STAGE:19] 2010-08-19 16:33:21,254
> ColumnFamilyStore.java
> > (line 357) Point.Time.Data.Value has reached its threshold; switching in
> a
> > fresh Memtable at
> >
> CommitLogContext(file='/var/lib/cassandra/commitlog/CommitLog-1282235556898.log',
> > position=15371920)
> >  INFO [ROW-MUTATION-STAGE:19] 2010-08-19 16:33:21,254
> ColumnFamilyStore.java
> > (line 609) Enqueuing flush of
> > Memtable-Point.Time.Data.Value@951707146(3462840 bytes, 314574
> operations)
> >  INFO [FLUSH-WRITER-POOL:1] 2010-08-19 16:33:21,254 Memtable.java (line
> 148)
> > Writing Memtable-Point.Time.Data.Value@951707146(3462840 bytes, 314574
> > operations)
> >  INFO [FLUSH-WRITER-POOL:1] 2010-08-19 16:33:22,090 Memtable.java (line
> 162)
> > Completed flushing
> > /mainraid/cassandra/data/Instance.4/Point.Time.Data.Value-362-Data.db
> >  INFO [ROW-MUTATION-STAGE:58] 2010-08-19 16:33:23,289
> ColumnFamilyStore.java
> > (line 357) Point.Data.Time.Value has reached its threshold; switching in
> a
> > fresh Memtable at
> >
> CommitLogContext(file='/var/lib/cassandra/commitlog/CommitLog-1282235556898.log',
> > position=25251401)
> >  INFO [ROW-MUTATION-STAGE:58] 2010-08-19 16:33:23,289
> ColumnFamilyStore.java
> > (line 609) Enqueuing flush of
> > Memtable-Point.Data.Time.Value@142023252(6330574 bytes, 314746
> operations)
> >  INFO [FLUSH-WRITER-POOL:1] 2010-08-19 16:33:23,289 Memtable.java (line
> 148)
> > Writing Memtable-Point.Data.Time.Value@142023252(6330574 bytes, 314746
> > operations)
> >  INFO [FLUSH-WRITER-POOL:1] 2010-08-19 16:33:23,785 Memtable.java (line
> 162)
> > Completed flushing
> > /mainraid/cassandra/data/Instance.4/Point.Data.Time.Value-1030-Data.db
> >  INFO [COMPACTION-POOL:1] 2010-08-19 16:33:23,787 CompactionManager.java
> > (line 246) Compacting
> >
> [org.apache.cassandra.io.SSTableReader(path='/mainraid/cassandra/data/Instance.4/Point.Data.Time.Value-1027-Data.db'),org.apache.cassandra.io.SSTableReader(path='/mainraid/cassandra/data/Instance.4/Point.Data.Time.Value-1028-Data.db'),org.apache.cassandra.io.SSTableReader(path='/mainraid/cassandra/data/Instance.4/Point.Data.Time.Value-1029-Data.db'),org.apache.cassandra.io.SSTableReader(path='/mainraid/cassandra/data/Instance.4/Point.Data.Time.Value-1030-Data.db')]
> >  INFO [GC inspection] 2010-08-19 16:34:03,565 GCInspector.java (line 116)
> GC
> > for ConcurrentMarkSweep: 39651 ms, 414341376 reclaimed leaving 8139886200
> > used; max is 8700035072
> >  INFO [GC inspection] 2010-08-19 16:34:03,566 GCInspector.java (line 137)
> > Pool Name                    Active   Pending
> >  INFO [GC inspection] 2010-08-19 16:34:03,566 GCInspector.java (line 143)
> > STREAM-STAGE                      0         0
> >  INFO [GC inspection] 2010-08-19 16:34:03,567 GCInspector.java (line 143)
> > FILEUTILS-DELETE-POOL             0         0
> >  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-19 16:34:03,567
> > SSTableDeletingReference.java (line 104) Deleted
> > /mainraid/cassandra/data/Instance.4/Point.Time.Data.Value-358-Data.db
> >  INFO [GC inspection] 2010-08-19 16:34:03,567 GCInspector.java (line 143)
> > RESPONSE-STAGE                    0         0
> >  INFO [GC inspection] 2010-08-19 16:34:03,584 GCInspector.java (line 143)
> > ROW-READ-STAGE                    0         0
> >  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-19 16:34:03,584
> > SSTableDeletingReference.java (line 104) Deleted
> > /mainraid/cassandra/data/system/HintsColumnFamily-7-Data.db
> >  INFO [GC inspection] 2010-08-19 16:34:03,584 GCInspector.java (line 143)
> > LB-OPERATIONS                     0         0
> >  INFO [GC inspection] 2010-08-19 16:34:03,585 GCInspector.java (line 143)
> > GMFD                              0         1
> >  INFO [GC inspection] 2010-08-19 16:34:03,585 GCInspector.java (line 143)
> > MESSAGE-DESERIALIZER-POOL         0         0
> >  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-19 16:34:03,585
> > SSTableDeletingReference.java (line 104) Deleted
> > /mainraid/cassandra/data/system/HintsColumnFamily-8-Data.db
> >  INFO [GC inspection] 2010-08-19 16:34:03,585 GCInspector.java (line 143)
> > CONSISTENCY-MANAGER               0         0
> >  INFO [GC inspection] 2010-08-19 16:34:03,586 GCInspector.java (line 143)
> > LB-TARGET                         0         0
> >  INFO [GC inspection] 2010-08-19 16:34:03,586 GCInspector.java (line 143)
> > ROW-MUTATION-STAGE                8        49
> >  INFO [GC inspection] 2010-08-19 16:34:03,587 GCInspector.java (line 143)
> > MESSAGE-STREAMING-POOL            0         0
> >  INFO [GC inspection] 2010-08-19 16:34:03,587 GCInspector.java (line 143)
> > LOAD-BALANCER-STAGE               0         0
> >  INFO [GC inspection] 2010-08-19 16:34:03,587 GCInspector.java (line 143)
> > FLUSH-SORTER-POOL                 0         0
> >  INFO [GC inspection] 2010-08-19 16:34:03,588 GCInspector.java (line 143)
> > MEMTABLE-POST-FLUSHER             1         1
> >  INFO [GC inspection] 2010-08-19 16:34:03,588 GCInspector.java (line 143)
> > AE-SERVICE-STAGE                  0         0
> >  INFO [GC inspection] 2010-08-19 16:34:03,588 GCInspector.java (line 143)
> > FLUSH-WRITER-POOL                 0         0
> >  INFO [GC inspection] 2010-08-19 16:34:03,588 GCInspector.java (line 143)
> > HINTED-HANDOFF-POOL               1         6
> >  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-19 16:34:03,589
> > SSTableDeletingReference.java (line 104) Deleted
> > /mainraid/cassandra/data/Instance.4/Point.Time.Data.Value-357-Data.db
> >  INFO [GC inspection] 2010-08-19 16:34:03,589 GCInspector.java (line 148)
> > CompactionManager               n/a         1
> >  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-19 16:34:03,600
> > SSTableDeletingReference.java (line 104) Deleted
> > /mainraid/cassandra/data/Instance.4/Point.Time.Data.Value-356-Data.db
> >  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-19 16:34:03,603
> > SSTableDeletingReference.java (line 104) Deleted
> > /mainraid/cassandra/data/Instance.4/Point.Time.Data.Value-359-Data.db
> >  INFO [COMPACTION-POOL:1] 2010-08-19 16:34:04,965 CompactionManager.java
> > (line 320) Compacted to
> > /mainraid/cassandra/data/Instance.4/Point.Data.Time.Value-1031-Data.db.
> > 41190813/36949030 bytes for 9048 keys.  Time: 1399ms.
> >  INFO [COMPACTION-POOL:1] 2010-08-19 16:34:04,965 CompactionManager.java
> > (line 246) Compacting
> >
> [org.apache.cassandra.io.SSTableReader(path='/mainraid/cassandra/data/Instance.4/Point.Data.Time.Value-1009-Data.db'),org.apache.cassandra.io.SSTableReader(path='/mainraid/cassandra/data/Instance.4/Point.Data.Time.Value-1017-Data.db'),org.apache.cassandra.io.SSTableReader(path='/mainraid/cassandra/data/Instance.4/Point.Data.Time.Value-1026-Data.db'),org.apache.cassandra.io.SSTableReader(path='/mainraid/cassandra/data/Instance.4/Point.Data.Time.Value-1031-Data.db')]
> >  INFO [GC inspection] 2010-08-19 16:34:46,656 GCInspector.java (line 116)
> GC
> > for ConcurrentMarkSweep: 41615 ms, 192522712 reclaimed leaving 8326856720
> > used; max is 8700035072
> >  INFO [GC inspection] 2010-08-19 16:34:46,657 GCInspector.java (line 137)
> > Pool Name                    Active   Pending
> >  INFO [GC inspection] 2010-08-19 16:34:46,657 GCInspector.java (line 143)
> > STREAM-STAGE                      0         0
> >  INFO [GC inspection] 2010-08-19 16:34:46,657 GCInspector.java (line 143)
> > FILEUTILS-DELETE-POOL             0         0
> >  INFO [GC inspection] 2010-08-19 16:34:46,658 GCInspector.java (line 143)
> > RESPONSE-STAGE                    0         0
> >  INFO [GC inspection] 2010-08-19 16:34:46,658 GCInspector.java (line 143)
> > ROW-READ-STAGE                    0         0
> >  INFO [GC inspection] 2010-08-19 16:34:46,658 GCInspector.java (line 143)
> > LB-OPERATIONS                     0         0
> >  INFO [GC inspection] 2010-08-19 16:34:46,659 GCInspector.java (line 143)
> > GMFD                              0         0
> >  INFO [GC inspection] 2010-08-19 16:34:46,659 GCInspector.java (line 143)
> > MESSAGE-DESERIALIZER-POOL         1         1
> >  INFO [GC inspection] 2010-08-19 16:34:46,659 GCInspector.java (line 143)
> > CONSISTENCY-MANAGER               0         0
> >  INFO [GC inspection] 2010-08-19 16:34:46,660 GCInspector.java (line 143)
> > LB-TARGET                         0         0
> >  INFO [GC inspection] 2010-08-19 16:34:46,660 GCInspector.java (line 143)
> > ROW-MUTATION-STAGE                4         1
> >  INFO [GC inspection] 2010-08-19 16:34:46,660 GCInspector.java (line 143)
> > MESSAGE-STREAMING-POOL            0         0
> >  INFO [GC inspection] 2010-08-19 16:34:46,661 GCInspector.java (line 143)
> > LOAD-BALANCER-STAGE               0         0
> >  INFO [GC inspection] 2010-08-19 16:34:46,661 GCInspector.java (line 143)
> > FLUSH-SORTER-POOL                 0         0
> >  INFO [GC inspection] 2010-08-19 16:34:46,661 GCInspector.java (line 143)
> > MEMTABLE-POST-FLUSHER             0         0
> >  INFO [GC inspection] 2010-08-19 16:34:46,662 GCInspector.java (line 143)
> > AE-SERVICE-STAGE                  0         0
> >  INFO [GC inspection] 2010-08-19 16:34:46,662 GCInspector.java (line 143)
> > FLUSH-WRITER-POOL                 0         0
> >  INFO [GC inspection] 2010-08-19 16:34:46,663 GCInspector.java (line 143)
> > HINTED-HANDOFF-POOL               1         6
> >  INFO [GC inspection] 2010-08-19 16:34:46,663 GCInspector.java (line 148)
> > CompactionManager               n/a         1
> >  INFO [Timer-0] 2010-08-19 16:36:00,785 Gossiper.java (line 180)
> InetAddress
> > /10.4.0.205 is now dead.
> >  INFO [Timer-0] 2010-08-19 16:36:00,786 Gossiper.java (line 180)
> InetAddress
> > /10.4.0.204 is now dead.
> >  INFO [GC inspection] 2010-08-19 16:36:00,786 GCInspector.java (line 116)
> GC
> > for ConcurrentMarkSweep: 37122 ms, 157488 reclaimed leaving 8342836376
> used;
> > max is 8700035072
> >  INFO [GC inspection] 2010-08-19 16:36:00,789 GCInspector.java (line 137)
> > Pool Name                    Active   Pending
> >  INFO [GC inspection] 2010-08-19 16:36:00,790 GCInspector.java (line 143)
> > STREAM-STAGE                      0         0
> >  INFO [GC inspection] 2010-08-19 16:36:00,790 GCInspector.java (line 143)
> > FILEUTILS-DELETE-POOL             0         0
> >  INFO [GC inspection] 2010-08-19 16:36:00,790 GCInspector.java (line 143)
> > RESPONSE-STAGE                    0         0
> >  INFO [GC inspection] 2010-08-19 16:36:00,790 GCInspector.java (line 143)
> > ROW-READ-STAGE                    0         0
> >  INFO [GC inspection] 2010-08-19 16:36:00,791 GCInspector.java (line 143)
> > LB-OPERATIONS                     0         0
> >  INFO [GC inspection] 2010-08-19 16:36:00,791 GCInspector.java (line 143)
> > GMFD                              0         0
> >  INFO [GC inspection] 2010-08-19 16:36:00,791 GCInspector.java (line 143)
> > MESSAGE-DESERIALIZER-POOL         0        82
> >  INFO [GC inspection] 2010-08-19 16:36:00,791 GCInspector.java (line 143)
> > CONSISTENCY-MANAGER               0         0
> >  INFO [GC inspection] 2010-08-19 16:36:00,792 GCInspector.java (line 143)
> > LB-TARGET                         0         0
> >  INFO [GC inspection] 2010-08-19 16:36:00,792 GCInspector.java (line 143)
> > ROW-MUTATION-STAGE                3       770
> >  INFO [GC inspection] 2010-08-19 16:36:00,792 GCInspector.java (line 143)
> > MESSAGE-STREAMING-POOL            0         0
> >  INFO [GC inspection] 2010-08-19 16:36:00,792 GCInspector.java (line 143)
> > LOAD-BALANCER-STAGE               0         0
> >  INFO [GC inspection] 2010-08-19 16:36:00,793 GCInspector.java (line 143)
> > FLUSH-SORTER-POOL                 0         0
> >  INFO [GC inspection] 2010-08-19 16:36:00,793 GCInspector.java (line 143)
> > MEMTABLE-POST-FLUSHER             0         0
> >  INFO [GC inspection] 2010-08-19 16:36:00,793 GCInspector.java (line 143)
> > AE-SERVICE-STAGE                  0         0
> >  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 16:36:00,796
> > MessageDeserializationTask.java (line 47) dropping message (44,076ms past
> > timeout)
> >  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 16:36:00,797
> > MessageDeserializationTask.java (line 47) dropping message (44,077ms past
> > timeout)
> >  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 16:36:00,797
> > MessageDeserializationTask.java (line 47) dropping message (44,077ms past
> > timeout)
> >  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 16:36:00,798
> > MessageDeserializationTask.java (line 47) dropping message (44,078ms past
> > timeout)
> >  INFO [Timer-0] 2010-08-19 16:36:00,797 Gossiper.java (line 180)
> InetAddress
> > /10.4.0.203 is now dead.
> >  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-19 16:36:00,797
> > SSTableDeletingReference.java (line 104) Deleted
> > /mainraid/cassandra/data/Instance.4/Point.Data.Time.Value-1030-Data.db
> >  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-19 16:36:00,805
> > SSTableDeletingReference.java (line 104) Deleted
> > /mainraid/cassandra/data/Instance.4/Point.Data.Time.Value-1028-Data.db
> >  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-19 16:36:00,808
> > SSTableDeletingReference.java (line 104) Deleted
> > /mainraid/cassandra/data/Instance.4/Point.Data.Time.Value-1027-Data.db
> >  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-19 16:36:00,812
> > SSTableDeletingReference.java (line 104) Deleted
> > /mainraid/cassandra/data/Instance.4/Point.Data.Time.Value-1029-Data.db
> >  INFO [GC inspection] 2010-08-19 16:36:00,833 GCInspector.java (line 143)
> > FLUSH-WRITER-POOL                 0         0
> >  INFO [GC inspection] 2010-08-19 16:36:00,834 GCInspector.java (line 143)
> > HINTED-HANDOFF-POOL               1         6
> >  INFO [GMFD:1] 2010-08-19 16:36:00,834 Gossiper.java (line 578)
> InetAddress
> > /10.4.0.204 is now UP
> >  INFO [GC inspection] 2010-08-19 16:36:00,834 GCInspector.java (line 148)
> > CompactionManager               n/a         1
> >  INFO [GMFD:1] 2010-08-19 16:36:00,839 Gossiper.java (line 578)
> InetAddress
> > /10.4.0.203 is now UP
> >  INFO [GMFD:1] 2010-08-19 16:36:00,839 Gossiper.java (line 578)
> InetAddress
> > /10.4.0.205 is now UP
> >  INFO [ROW-MUTATION-STAGE:34] 2010-08-19 16:36:00,898
> ColumnFamilyStore.java
> > (line 357) Point.Time.Data.Value has reached its threshold; switching in
> a
> > fresh Memtable at
> >
> CommitLogContext(file='/var/lib/cassandra/commitlog/CommitLog-1282235556898.log',
> > position=41642302)
> >  INFO [ROW-MUTATION-STAGE:34] 2010-08-19 16:36:00,898
> ColumnFamilyStore.java
> > (line 609) Enqueuing flush of
> > Memtable-Point.Time.Data.Value@543707511(4032380 bytes, 374579
> operations)
> >  INFO [FLUSH-WRITER-POOL:1] 2010-08-19 16:36:00,898 Memtable.java (line
> 148)
> > Writing Memtable-Point.Time.Data.Value@543707511(4032380 bytes, 374579
> > operations)
> > ERROR [Thread-121] 2010-08-19 16:36:00,911 CassandraDaemon.java (line 83)
> > Uncaught exception in thread Thread[Thread-121,5,main]
> > java.lang.OutOfMemoryError: Java heap space
> >     at
> >
> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:71)
> >
> >
> >
> > Messages prior to last crash.
> >
> >
> >  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 18:06:32,245
> > MessageDeserializationTask.java (line 47) dropping message (525,696ms
> past
> > timeout)
> >  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 18:07:09,391
> > MessageDeserializationTask.java (line 47) dropping message (562,843ms
> past
> > timeout)
> >  INFO [GC inspection] 2010-08-19 18:07:09,391 GCInspector.java (line 143)
> > LB-OPERATIONS                     0         0
> >  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 18:07:09,392
> > MessageDeserializationTask.java (line 47) dropping message (562,843ms
> past
> > timeout)
> >  INFO [GC inspection] 2010-08-19 18:07:09,392 GCInspector.java (line 143)
> > GMFD                              0         0
> >  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 18:07:09,392
> > MessageDeserializationTask.java (line 47) dropping message (562,844ms
> past
> > timeout)
> >  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 18:07:09,394
> > MessageDeserializationTask.java (line 47) dropping message (562,846ms
> past
> > timeout)
> >  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 18:07:42,315
> > MessageDeserializationTask.java (line 47) dropping message (595,767ms
> past
> > timeout)
> >  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 18:07:42,316
> > MessageDeserializationTask.java (line 47) dropping message (595,768ms
> past
> > timeout)
> >  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 18:07:42,316
> > MessageDeserializationTask.java (line 47) dropping message (595,768ms
> past
> > timeout)
> >  INFO [GC inspection] 2010-08-19 18:07:41,965 GCInspector.java (line 143)
> > MESSAGE-DESERIALIZER-POOL         1    567231
> >  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 18:08:18,189
> > MessageDeserializationTask.java (line 47) dropping message (631,641ms
> past
> > timeout)
> >  INFO [GC inspection] 2010-08-19 18:08:18,192 GCInspector.java (line 143)
> > CONSISTENCY-MANAGER               0         0
> >  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 18:08:18,193
> > MessageDeserializationTask.java (line 47) dropping message (631,645ms
> past
> > timeout)
> >  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 18:09:18,355
> > MessageDeserializationTask.java (line 47) dropping message (691,807ms
> past
> > timeout)
> >  INFO [GC inspection] 2010-08-19 18:09:18,356 GCInspector.java (line 143)
> > LB-TARGET                         0         0
> >  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 18:09:44,730
> > MessageDeserializationTask.java (line 47) dropping message (691,808ms
> past
> > timeout)
> > ERROR [COMPACTION-POOL:1] 2010-08-19 18:18:42,658 CassandraDaemon.java
> (line
> > 83) Uncaught exception in thread Thread[COMPACTION-POOL:1,1,main]
> > java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java
> > heap space
> >     at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
> >     at java.util.concurrent.FutureTask.get(FutureTask.java:83)
> >     at
> >
> org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.afterExecute(DebuggableThreadPoolExecutor.java:86)
> >     at
> >
> org.apache.cassandra.db.CompactionManager$CompactionExecutor.afterExecute(CompactionManager.java:577)
> >     at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:888)
> >     at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> >     at java.lang.Thread.run(Thread.java:619)
> > Caused by: java.lang.OutOfMemoryError: Java heap space
> >     at
> >
> org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:84)
> >     at
> >
> org.apache.cassandra.db.SuperColumnSerializer.deserialize(SuperColumn.java:335)
> >     at
> >
> org.apache.cassandra.db.SuperColumnSerializer.deserialize(SuperColumn.java:284)
> >     at
> >
> org.apache.cassandra.db.ColumnFamilySerializer.deserializeColumns(ColumnFamilySerializer.java:117)
> >     at
> >
> org.apache.cassandra.db.ColumnFamilySerializer.deserializeFromSSTable(ColumnFamilySerializer.java:160)
> >     at
> >
> org.apache.cassandra.io.IteratingRow.getColumnFamily(IteratingRow.java:79)
> >     at
> >
> org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:112)
> >     at
> >
> org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:41)
> >     at
> >
> org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:73)
> >     at
> >
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:135)
> >     at
> >
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:130)
> >     at
> >
> org.apache.commons.collections.iterators.FilterIterator.setNextObject(FilterIterator.java:183)
> >     at
> >
> org.apache.commons.collections.iterators.FilterIterator.hasNext(FilterIterator.java:94)
> >     at
> >
> org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:295)
> >     at
> >
> org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:102)
> >     at
> >
> org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:83)
> >     at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> >     at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> >     at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> >     ... 2 more
> > ERROR [Thread-13] 2010-08-19 18:18:42,658 CassandraDaemon.java (line 83)
> > Uncaught exception in thread Thread[Thread-13,5,main]
> > java.lang.OutOfMemoryError: Java heap space
> >     at java.util.concurrent.FutureTask.<init>(FutureTask.java:64)
> >     at
> >
> java.util.concurrent.AbstractExecutorService.newTaskFor(AbstractExecutorService.java:58)
> >     at
> >
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:77)
> >     at
> >
> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:73)
> >  INFO [GC inspection] 2010-08-19 18:18:42,658 GCInspector.java (line 143)
> > ROW-MUTATION-STAGE                3         3
> >  INFO [GC inspection] 2010-08-19 18:18:42,666 GCInspector.java (line 143)
> > MESSAGE-STREAMING-POOL            0         0
> >  INFO [COMPACTION-POOL:1] 2010-08-19 18:18:42,659 CompactionManager.java
> > (line 246) Compacting
> >
> [org.apache.cassandra.io.SSTableReader(path='/mainraid/cassandra/data/Instance.4/Point.Data.Time.Value-1071-Data.db'),org.apache.cassandra.io.SSTableReader(path='/mainraid/cassandra/data/Instance.4/Point.Data.Time.Value-1072-Data.db'),org.apache.cassandra.io.SSTableReader(path='/mainraid/cassandra/data/Instance.4/Point.Data.Time.Value-1073-Data.db'),org.apache.cassandra.io.SSTableReader(path='/mainraid/cassandra/data/Instance.4/Point.Data.Time.Value-1074-Data.db'),org.apache.cassandra.io.SSTableReader(path='/mainraid/cassandra/data/Instance.4/Point.Data.Time.Value-1075-Data.db'),org.apache.cassandra.io.SSTableReader(path='/mainraid/cassandra/data/Instance.4/Point.Data.Time.Value-1076-Data.db'),org.apache.cassandra.io.SSTableReader(path='/mainraid/cassandra/data/Instance.4/Point.Data.Time.Value-1077-Data.db'),org.apache.cassandra.io.SSTableReader(path='/mainraid/cassandra/data/Instance.4/Point.Data.Time.Value-1078-Data.db')]
> >  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 18:18:42,658
> > MessageDeserializationTask.java (line 47) dropping message (1,256,110ms
> past
> > timeout)
> >  INFO [GC inspection] 2010-08-19 18:18:42,666 GCInspector.java (line 143)
> > LOAD-BALANCER-STAGE               0         0
> >  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 18:18:42,667
> > MessageDeserializationTask.java (line 47) dropping message (1,256,119ms
> past
> > timeout)
> >  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 18:18:42,960
> > MessageDeserializationTask.java (line 47) dropping message (1,256,412ms
> past
> > timeout)
> >  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 18:18:42,960
> > MessageDeserializationTask.java (line 47) dropping message (1,256,412ms
> past
> > timeout)
> >  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 18:18:42,960
> > MessageDeserializationTask.java (line 47) dropping message (1,256,412ms
> past
> > timeout)
> >  INFO [GC inspection] 2010-08-19 18:18:42,960 GCInspector.java (line 143)
> > FLUSH-SORTER-POOL                 0         0
> >  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 18:18:42,960
> > MessageDeserializationTask.java (line 47) dropping message (1,256,412ms
> past
> > timeout)
> >  INFO [GC inspection] 2010-08-19 18:18:42,962 GCInspector.java (line 143)
> > MEMTABLE-POST-FLUSHER             0         0
> >  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 18:18:42,963
> > MessageDeserializationTask.java (line 47) dropping message (1,256,415ms
> past
> > timeout)
> >  INFO [GC inspection] 2010-08-19 18:18:42,965 GCInspector.java (line 143)
> > AE-SERVICE-STAGE                  0         0
> >  INFO [GC inspection] 2010-08-19 18:18:42,968 GCInspector.java (line 143)
> > FLUSH-WRITER-POOL                 0         0
> >
> >
> >
> >
>
> First,
> I have some general questions. Are you using random partitioner? Are
> your tokens calculated properly to spread data around the ring or did
> cassandra auto-choose tokens? What to your system vitals look like?
> What is the disk statistics are the disks saturated? Is the system in
> IO wait? top, iostat,sar. What does CFSTATS say? What is your
> largestRow? What is your average row? What is your cache hit rate?
> What is your replication factor?  Exactly how many rows are you
> inserting a second? If you throttle your load while tuning up your
> settings that is more scientific then firing full blast bulk loads
> that causes crashes.
>
> I notice you said this... 6 x Nodes with 8 cores, 8G heap, 24g physical.
> I would go higher with the heap 8 cores, 8G heap, 24g physical. If you
> are hitting OOM with 8G go higher. If you are going to use row cache
> you should probably bring this really high.
>
> You are getting OOM. I notice you tried rowcaching and keycaching. How
> did you set this to fixed size or % ? You should almost always set it
> to fixed size.
>
> Here is something profound that I learned:
> Larger row cache can SAVE you memory.
> What huh? How can that be? more memory uses less memory?
> Having a small row/key cache caused higher disk utilization. If all
> requests are going to disk request take longer using more memory for
> longer. BUT if you get that request served off Row Cache. Pow Zip Bang
> the request is in and out fast, and you never get into using those
> longer lived java memory pools.
> Then again if your workload is ALL write, caches will not help you with
> that.
>
> Set this low:
>  <RowWarningThresholdInMB>10</RowWarningThresholdInMB>
>  I found that rows with millions of columns would cause havoc for me.
> This helped me find some bad rows.
>
> How much data is on each node 'nodetool ring' output
>

Mime
View raw message