Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 18777 invoked from network); 6 Apr 2010 06:49:16 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 6 Apr 2010 06:49:16 -0000 Received: (qmail 58840 invoked by uid 500); 6 Apr 2010 06:49:15 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 58823 invoked by uid 500); 6 Apr 2010 06:49:15 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 58815 invoked by uid 99); 6 Apr 2010 06:49:15 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Apr 2010 06:49:15 +0000 X-ASF-Spam-Status: No, hits=-0.2 required=10.0 tests=AWL,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ivmaykov@gmail.com designates 209.85.211.195 as permitted sender) Received: from [209.85.211.195] (HELO mail-yw0-f195.google.com) (209.85.211.195) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 06 Apr 2010 06:49:08 +0000 Received: by ywh33 with SMTP id 33so3009928ywh.11 for ; Mon, 05 Apr 2010 23:48:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:received:message-id:subject:from:to:content-type :content-transfer-encoding; bh=d+80OUU9YRFEi1MjdGmMEzcBpZNRaJqU/gz3B+P/XeE=; b=UTUal2fvtMFsp9a1dlLQhaa0fQo+WRGB0nZ9AosbrqYKBPd90i16HCOQKVXu6TuH9X RU+YmbSyhDubJp7uYSD3GMHlLJKHr2QfVQhXrr9/F+c4dJjB1sqX0hTXVN3WcPcN26bI B2zlhhQ3aNxFPjPlYJBpQ9o5nVVIE90XfwKZI= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=lbZ7fxqZpPNXAcHo7BoLukMrOh2HQP56z0WzlF+eVsxTThdpQTVhxcgPL6P+2lmTF2 uOxkztfJ3d91FhoUAK3Q9PrICl1viJh/nohsCqCh8WIaJxRdhqknpAtDT5KpiNEMmQHe SnN1ndwa3s7w7oAGmGLbIg1HtCiGJBdg80APs= MIME-Version: 1.0 Received: by 10.151.10.10 with HTTP; Mon, 5 Apr 2010 23:48:47 -0700 (PDT) In-Reply-To: References: Date: Mon, 5 Apr 2010 23:48:47 -0700 Received: by 10.150.160.6 with SMTP id i6mr1621543ybe.43.1270536527289; Mon, 05 Apr 2010 23:48:47 -0700 (PDT) Message-ID: Subject: Re: Overwhelming a cluster with writes? From: Ilya Maykov To: user@cassandra.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable No, the disks on all nodes have about 750GB free space. Also as mentioned in my follow-up email, writing with ConsistencyLevel.ALL makes the slowdowns / crashes go away. -- Ilya On Mon, Apr 5, 2010 at 11:46 PM, Ran Tavory wrote: > Do you see one of the disks used by cassandra filled up when a node crash= es? > > On Tue, Apr 6, 2010 at 9:39 AM, Ilya Maykov wrote: >> >> I'm running the nodes with a JVM heap size of 6GB, and here are the >> related options from my storage-conf.xml. As mentioned in the first >> email, I left everything at the default value. I briefly googled >> around for "Cassandra performance tuning" etc but haven't found a >> definitive guide ... any help with tuning these parameters is greatly >> appreciated! >> >> =A0auto >> =A0512 >> =A064 >> =A032 >> =A08 >> =A064 >> =A064 >> =A0256 >> =A00.3 >> =A060 >> =A08 >> =A064 >> =A0periodic >> =A010000 >> =A0864000 >> >> -- Ilya >> >> On Mon, Apr 5, 2010 at 11:26 PM, Boris Shulman wrot= e: >> > You are running out of memory on your nodes. Before the final crash >> > your nodes are probably slow =A0due to GC. What is your memtable size? >> > What cache options did you configure? >> > >> > On Tue, Apr 6, 2010 at 7:31 AM, Ilya Maykov wrote= : >> >> Hi all, >> >> >> >> I've just started experimenting with Cassandra to get a feel for the >> >> system. I've set up a test cluster and to get a ballpark idea of its >> >> performance I wrote a simple tool to load some toy data into the >> >> system. Surprisingly, I am able to "overwhelm" my 4-node cluster with >> >> writes from a single client. I'm trying to figure out if this is a >> >> problem with my setup, if I'm hitting bugs in the Cassandra codebase, >> >> or if this is intended behavior. Sorry this email is kind of long, >> >> here is the TLDR version: >> >> >> >> While writing to Cassandra from a single node, I am able to get the >> >> cluster into a bad state, where nodes are randomly disconnecting from >> >> each other, write performance plummets, and sometimes nodes even >> >> crash. Further, the nodes do not recover as long as the writes >> >> continue (even at a much lower rate), and sometimes do not recover at >> >> all unless I restart them. I can get this to happen simply by throwin= g >> >> data at the cluster fast enough, and I'm wondering if this is a known >> >> issue or if I need to tweak my setup. >> >> >> >> Now, the details. >> >> >> >> First, a little bit about the setup: >> >> >> >> 4-node cluster of identical machines, running cassandra-0.6.0-rc1 wit= h >> >> the fixes for CASSANDRA-933, CASSANDRA-934, and CASSANDRA-936 patched >> >> in. Node specs: >> >> 8-core Intel Xeon E5405@2.00GHz >> >> 8GB RAM >> >> 1Gbit ethernet >> >> Red Hat Linux 2.6.18 >> >> JVM 1.6.0_19 64-bit >> >> 1TB spinning disk houses both commitlog and data directories (which I >> >> know is not ideal). >> >> The client machine is on the same local network and has very similar >> >> specs. >> >> >> >> The cassandra nodes are started with the following JVM options: >> >> >> >> ./cassandra JVM_OPTS=3D"-Xms6144m -Xmx6144m -XX:+UseConcMarkSweepGC -= d64 >> >> -XX:NewSize=3D1024m -XX:MaxNewSize=3D1024m -XX:+DisableExplicitGC" >> >> >> >> I'm using default settings for all of the tunable stuff at the bottom >> >> of storage-conf.xml. I also selected my initial tokens to evenly >> >> partition the key space when the cluster was bootstrapped. I am using >> >> the RandomPartitioner. >> >> >> >> Now, about the test. Basically I am trying to get an idea of just how >> >> fast I can make this thing go. I am writing ~250M data records into >> >> the cluster, replicated at 3x, using Ran Tavory's Hector client >> >> (Java), writing with ConsistencyLevel.ZERO and >> >> FailoverPolicy.FAIL_FAST. The client is using 32 threads with 8 >> >> threads talking to each of the 4 nodes in the cluster. Records are >> >> identified by a numeric id, and I'm writing them in batches of up to >> >> 10k records per row, with each record in its own column. The row key >> >> identifies the bucket into which records fall. So, records with ids 0 >> >> - 9999 are written to row "0", 10000 - 19999 are written to row >> >> "10000", etc. Each record is a JSON object with ~10-20 fields. >> >> >> >> Records: { =A0// Column Family >> >> =A0 0 : { =A0// row key for the start of the bucket. Buckets span a r= ange >> >> of up to 10000 records >> >> =A0 =A0 1 : "{ /* some JSON */ }", =A0// Column for record with id=3D= 1 >> >> =A0 =A0 3 : "{ /* some more JSON */ }", =A0// Column for record with = id=3D3 >> >> =A0 =A0... >> >> =A0 =A09999 : "{ /* ... */ }" >> >> =A0 }, >> >> =A010000 : { =A0// row key for the start of the next bucket >> >> =A0 =A010001 : ... >> >> =A0 =A010004 : >> >> } >> >> >> >> I am reading the data out of a local, sorted file on the client, so I >> >> only write a row to Cassandra once all records for that row have been >> >> read, and each row is written to exactly once. I'm using a >> >> producer-consumer queue to pump data from the input reader thread to >> >> the output writer threads. I found that I have to throttle the reader >> >> thread heavily in order to get good behavior. So, if I make the reade= r >> >> sleep for 7 seconds every 1M records, everything is fine - the data >> >> loads in about an hour, half of which is spent by the reader thread >> >> sleeping. In between the sleeps, I see ~40-50 MB/s throughput on the >> >> client's network interface while the reader is not sleeping, and it >> >> takes ~7-8 seconds to write each batch of 1M records. >> >> >> >> Now, if I remove the 7 second sleeps on the client side, things get >> >> bad after the first ~8M records are written to the client. Write >> >> throughput drops to <5 MB/s. I start seeing messages about nodes >> >> disconnecting and reconnecting in Cassandra's system.log, as well as >> >> lots of GC messages: >> >> >> >> ... >> >> =A0INFO [Timer-1] 2010-04-06 04:03:27,178 Gossiper.java (line 179) >> >> InetAddress /10.15.38.88 is now dead. >> >> =A0INFO [GC inspection] 2010-04-06 04:03:30,259 GCInspector.java (lin= e >> >> 110) GC for ConcurrentMarkSweep: 2989 ms, 55326320 reclaimed leaving >> >> 1035998648 used; max is 1211170816 >> >> =A0INFO [GC inspection] 2010-04-06 04:03:41,838 GCInspector.java (lin= e >> >> 110) GC for ConcurrentMarkSweep: 3004 ms, 24377240 reclaimed leaving >> >> 1066120952 used; max is 1211170816 >> >> =A0INFO [Timer-1] 2010-04-06 04:03:44,136 Gossiper.java (line 179) >> >> InetAddress /10.15.38.55 is now dead. >> >> =A0INFO [GMFD:1] 2010-04-06 04:03:44,138 Gossiper.java (line 568) >> >> InetAddress /10.15.38.55 is now UP >> >> =A0INFO [GC inspection] 2010-04-06 04:03:52,957 GCInspector.java (lin= e >> >> 110) GC for ConcurrentMarkSweep: 2319 ms, 4504888 reclaimed leaving >> >> 1086023832 used; max is 1211170816 >> >> =A0INFO [Timer-1] 2010-04-06 04:04:19,508 Gossiper.java (line 179) >> >> InetAddress /10.15.38.242 is now dead. >> >> =A0INFO [Timer-1] 2010-04-06 04:05:03,039 Gossiper.java (line 179) >> >> InetAddress /10.15.38.55 is now dead. >> >> =A0INFO [GMFD:1] 2010-04-06 04:05:03,041 Gossiper.java (line 568) >> >> InetAddress /10.15.38.55 is now UP >> >> =A0INFO [GC inspection] 2010-04-06 04:05:08,539 GCInspector.java (lin= e >> >> 110) GC for ConcurrentMarkSweep: 2375 ms, 39534920 reclaimed leaving >> >> 1051620856 used; max is 1211170816 >> >> ... >> >> >> >> Finally followed by this and some/all nodes going down: >> >> >> >> ERROR [COMPACTION-POOL:1] 2010-04-06 04:05:05,475 >> >> DebuggableThreadPoolExecutor.java (line 94) Error in executor >> >> futuretask >> >> java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: >> >> Java heap space >> >> =A0 =A0 =A0 =A0at java.util.concurrent.FutureTask$Sync.innerGet(Unkno= wn Source) >> >> =A0 =A0 =A0 =A0at java.util.concurrent.FutureTask.get(Unknown Source) >> >> =A0 =A0 =A0 =A0at >> >> org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.afterExe= cute(DebuggableThreadPoolExecutor.java:86) >> >> =A0 =A0 =A0 =A0at >> >> org.apache.cassandra.db.CompactionManager$CompactionExecutor.afterExe= cute(CompactionManager.java:582) >> >> =A0 =A0 =A0 =A0at >> >> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source= ) >> >> =A0 =A0 =A0 =A0at java.util.concurrent.ThreadPoolExecutor$Worker.run(= Unknown >> >> Source) >> >> =A0 =A0 =A0 =A0at java.lang.Thread.run(Unknown Source) >> >> Caused by: java.lang.OutOfMemoryError: Java heap space >> >> =A0 =A0 =A0 =A0at java.util.Arrays.copyOf(Unknown Source) >> >> =A0 =A0 =A0 =A0at java.io.ByteArrayOutputStream.write(Unknown Source) >> >> =A0 =A0 =A0 =A0at java.io.DataOutputStream.write(Unknown Source) >> >> =A0 =A0 =A0 =A0at >> >> org.apache.cassandra.io.IteratingRow.echoData(IteratingRow.java:69) >> >> =A0 =A0 =A0 =A0at >> >> org.apache.cassandra.io.CompactionIterator.getReduced(CompactionItera= tor.java:138) >> >> =A0 =A0 =A0 =A0at >> >> org.apache.cassandra.io.CompactionIterator.getReduced(CompactionItera= tor.java:1) >> >> =A0 =A0 =A0 =A0at >> >> org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingItera= tor.java:73) >> >> =A0 =A0 =A0 =A0at >> >> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractI= terator.java:135) >> >> =A0 =A0 =A0 =A0at >> >> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.j= ava:130) >> >> =A0 =A0 =A0 =A0at >> >> org.apache.commons.collections.iterators.FilterIterator.setNextObject= (FilterIterator.java:183) >> >> =A0 =A0 =A0 =A0at >> >> org.apache.commons.collections.iterators.FilterIterator.hasNext(Filte= rIterator.java:94) >> >> =A0 =A0 =A0 =A0at >> >> org.apache.cassandra.db.CompactionManager.doCompaction(CompactionMana= ger.java:299) >> >> =A0 =A0 =A0 =A0at >> >> org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.ja= va:102) >> >> =A0 =A0 =A0 =A0at >> >> org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.ja= va:1) >> >> =A0 =A0 =A0 =A0at java.util.concurrent.FutureTask$Sync.innerRun(Unkno= wn Source) >> >> =A0 =A0 =A0 =A0at java.util.concurrent.FutureTask.run(Unknown Source) >> >> =A0 =A0 =A0 =A0... 3 more >> >> >> >> At first I thought that with ConsistencyLevel.ZERO I must be doing >> >> async writes so Cassandra can't push back on the client threads (by >> >> blocking them), thus the server is getting overwhelmed. But, I would >> >> expect it to start dropping data and not crash in that case (after >> >> all, I did say ZERO so I can't expect any reliability, right?). >> >> However, I see similar slowdown / node dropout behavior when I set th= e >> >> consistency level to ONE. Does Cassandra push back on writers under >> >> heavy load? Is there some magic setting I need to tune to have it not >> >> fall over? Do I just need a bigger cluster? Thanks in advance, >> >> >> >> -- Ilya >> >> >> >> P.S. I realize that it's still handling a LOT of data with just 4 >> >> nodes, and in practice nobody would run a system that gets 125k write= s >> >> per second on top of a 4 node cluster. I was just surprised that I >> >> could make Cassandra fall over at all using a single client that's >> >> pumping data at 40-50 MB/s. >> >> >> > > >