incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Goffinet <goffi...@digg.com>
Subject Re: Cassandra cluster runs into OOM when bulk loading data
Date Tue, 27 Apr 2010 02:53:46 GMT
I'll work on doing more tests around this. In 0.5 we used a different data structure that required
polling. But this does seem problematic. 

-Chris

On Apr 26, 2010, at 7:04 PM, Eric Yu wrote:

> I have the same problem here, and I analysised the hprof file with mat, as you said,
LinkedBlockQueue used 2.6GB.
> I think the ThreadPool of cassandra should limit the queue size.
> 
> cassandra 0.6.1
> 
> java version
> $ java -version
> java version "1.6.0_20"
> Java(TM) SE Runtime Environment (build 1.6.0_20-b02)
> Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01, mixed mode)
> 
> iostat
> $ iostat -x -l 1
> Device:         rrqm/s   wrqm/s   r/s   w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await
 svctm  %util
> sda              81.00  8175.00 224.00 17.00 23984.00  2728.00   221.68     1.01    1.86
  0.76  18.20
> 
> tpstats, of coz, this node is still alive
> $ ./nodetool -host localhost tpstats  
> Pool Name                    Active   Pending      Completed
> FILEUTILS-DELETE-POOL             0         0           1281
> STREAM-STAGE                      0         0              0
> RESPONSE-STAGE                    0         0      473617241
> ROW-READ-STAGE                    0         0              0
> LB-OPERATIONS                     0         0              0
> MESSAGE-DESERIALIZER-POOL         0         0      718355184
> GMFD                              0         0         132509
> LB-TARGET                         0         0              0
> CONSISTENCY-MANAGER               0         0              0
> ROW-MUTATION-STAGE                0         0      293735704
> MESSAGE-STREAMING-POOL            0         0              6
> LOAD-BALANCER-STAGE               0         0              0
> FLUSH-SORTER-POOL                 0         0              0
> MEMTABLE-POST-FLUSHER             0         0           1870
> FLUSH-WRITER-POOL                 0         0           1870
> AE-SERVICE-STAGE                  0         0              5
> HINTED-HANDOFF-POOL               0         0             21
> 
> 
> On Tue, Apr 27, 2010 at 3:32 AM, Chris Goffinet <goffinet@digg.com> wrote:
> Upgrade to b20 of Sun's version of JVM. This OOM might be related to LinkedBlockQueue
issues that were fixed.
> 
> -Chris
> 
> 
> 2010/4/26 Roland Hänel <roland@haenel.me>
> Cassandra Version 0.6.1
> OpenJDK Server VM (build 14.0-b16, mixed mode)
> Import speed is about 10MB/s for the full cluster; if a compaction is going on the individual
node is I/O limited
> tpstats: caught me, didn't know this. I will set up a test and try to catch a node during
the critical time.
> 
> Thanks,
> Roland
> 
> 
> 2010/4/26 Chris Goffinet <goffinet@digg.com>
> 
> Which version of Cassandra?
> Which version of Java JVM are you using?
> What do your I/O stats look like when bulk importing?
> When you run `nodeprobe -host XXXX tpstats` is any thread pool backing up during the
import?
> 
> -Chris
> 
> 
> 2010/4/26 Roland Hänel <roland@haenel.me>
> 
> I have a cluster of 5 machines building a Cassandra datastore, and I load bulk data into
this using the Java Thrift API. The first ~250GB runs fine, then, one of the nodes starts
to throw OutOfMemory exceptions. I'm not using and row or index caches, and since I only have
5 CF's and some 2,5 GB of RAM allocated to the JVM (-Xmx2500M), in theory, that should happen.
All inserts are done with consistency level ALL.
> 
> I hope with this I have avoided all the 'usual dummy errors' that lead to OOM's. I have
begun to troubleshoot the issue with JMX, however, it's difficult to catch the JVM in the
right moment because it runs well for several hours before this thing happens.
> 
> One thing gets to my mind, maybe one of the experts could confirm or reject this idea
for me: is it possible that when one machine slows down a little bit (for example because
a big compaction is going on), the memtables don't get flushed to disk as fast as they are
building up under the continuing bulk import? That would result in a downward spiral, the
system gets slower and slower on disk I/O, but since more and more data arrives over Thrift,
finally OOM.
> 
> I'm using the "periodic" commit log sync, maybe also this could create a situation where
the commit log writer is too slow to catch up with the data intake, resulting in ever growing
memory usage?
> 
> Maybe these thoughts are just bullshit. Let me now if so... ;-)
> 
> 
> 
> 
> 
> 


Mime
View raw message