Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of ivmaykov@gmail.com designates
 209.85.211.195 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type:content-transfer-encoding;
        b=lbZ7fxqZpPNXAcHo7BoLukMrOh2HQP56z0WzlF+eVsxTThdpQTVhxcgPL6P+2lmTF2
         uOxkztfJ3d91FhoUAK3Q9PrICl1viJh/nohsCqCh8WIaJxRdhqknpAtDT5KpiNEMmQHe
         SnN1ndwa3s7w7oAGmGLbIg1HtCiGJBdg80APs=
MIME-Version: 1.0
In-Reply-To: <j2je4b4609b1004052346v6e1cbaa9i7d8da5dca91ba240@mail.gmail.com>
References: <v2n8a098eb1004052131z69e1c195y4596092960e9f233@mail.gmail.com>
	 <q2za4f278e51004052326w128cb34fndbad874b4b521bb6@mail.gmail.com>
	 <m2m8a098eb1004052339t44c5ffbfu953afd230c34581b@mail.gmail.com>
	 <j2je4b4609b1004052346v6e1cbaa9i7d8da5dca91ba240@mail.gmail.com>
Date: Mon, 5 Apr 2010 23:48:47 -0700
Message-ID: <k2n8a098eb1004052348jef15e443u1f8bf205320e50a@mail.gmail.com>
Subject: Re: Overwhelming a cluster with writes?
From: Ilya Maykov <ivmaykov@gmail.com>
To: user@cassandra.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

No, the disks on all nodes have about 750GB free space. Also as
mentioned in my follow-up email, writing with ConsistencyLevel.ALL
makes the slowdowns / crashes go away.

-- Ilya

On Mon, Apr 5, 2010 at 11:46 PM, Ran Tavory <rantav@gmail.com> wrote:
> Do you see one of the disks used by cassandra filled up when a node crash=
es?
>
> On Tue, Apr 6, 2010 at 9:39 AM, Ilya Maykov <ivmaykov@gmail.com> wrote:
>>
>> I'm running the nodes with a JVM heap size of 6GB, and here are the
>> related options from my storage-conf.xml. As mentioned in the first
>> email, I left everything at the default value. I briefly googled
>> around for "Cassandra performance tuning" etc but haven't found a
>> definitive guide ... any help with tuning these parameters is greatly
>> appreciated!
>>
>> =A0<DiskAccessMode>auto</DiskAccessMode>
>> =A0<RowWarningThresholdInMB>512</RowWarningThresholdInMB>
>> =A0<SlicedBufferSizeInKB>64</SlicedBufferSizeInKB>
>> =A0<FlushDataBufferSizeInMB>32</FlushDataBufferSizeInMB>
>> =A0<FlushIndexBufferSizeInMB>8</FlushIndexBufferSizeInMB>
>> =A0<ColumnIndexSizeInKB>64</ColumnIndexSizeInKB>
>> =A0<MemtableThroughputInMB>64</MemtableThroughputInMB>
>> =A0<BinaryMemtableThroughputInMB>256</BinaryMemtableThroughputInMB>
>> =A0<MemtableOperationsInMillions>0.3</MemtableOperationsInMillions>
>> =A0<MemtableFlushAfterMinutes>60</MemtableFlushAfterMinutes>
>> =A0<ConcurrentReads>8</ConcurrentReads>
>> =A0<ConcurrentWrites>64</ConcurrentWrites>
>> =A0<CommitLogSync>periodic</CommitLogSync>
>> =A0<CommitLogSyncPeriodInMS>10000</CommitLogSyncPeriodInMS>
>> =A0<GCGraceSeconds>864000</GCGraceSeconds>
>>
>> -- Ilya
>>
>> On Mon, Apr 5, 2010 at 11:26 PM, Boris Shulman <shulmanb@gmail.com> wrot=
e:
>> > You are running out of memory on your nodes. Before the final crash
>> > your nodes are probably slow =A0due to GC. What is your memtable size?
>> > What cache options did you configure?
>> >
>> > On Tue, Apr 6, 2010 at 7:31 AM, Ilya Maykov <ivmaykov@gmail.com> wrote=
:
>> >> Hi all,
>> >>
>> >> I've just started experimenting with Cassandra to get a feel for the
>> >> system. I've set up a test cluster and to get a ballpark idea of its
>> >> performance I wrote a simple tool to load some toy data into the
>> >> system. Surprisingly, I am able to "overwhelm" my 4-node cluster with
>> >> writes from a single client. I'm trying to figure out if this is a
>> >> problem with my setup, if I'm hitting bugs in the Cassandra codebase,
>> >> or if this is intended behavior. Sorry this email is kind of long,
>> >> here is the TLDR version:
>> >>
>> >> While writing to Cassandra from a single node, I am able to get the
>> >> cluster into a bad state, where nodes are randomly disconnecting from
>> >> each other, write performance plummets, and sometimes nodes even
>> >> crash. Further, the nodes do not recover as long as the writes
>> >> continue (even at a much lower rate), and sometimes do not recover at
>> >> all unless I restart them. I can get this to happen simply by throwin=
g
>> >> data at the cluster fast enough, and I'm wondering if this is a known
>> >> issue or if I need to tweak my setup.
>> >>
>> >> Now, the details.
>> >>
>> >> First, a little bit about the setup:
>> >>
>> >> 4-node cluster of identical machines, running cassandra-0.6.0-rc1 wit=
h
>> >> the fixes for CASSANDRA-933, CASSANDRA-934, and CASSANDRA-936 patched
>> >> in. Node specs:
>> >> 8-core Intel Xeon E5405@2.00GHz
>> >> 8GB RAM
>> >> 1Gbit ethernet
>> >> Red Hat Linux 2.6.18
>> >> JVM 1.6.0_19 64-bit
>> >> 1TB spinning disk houses both commitlog and data directories (which I
>> >> know is not ideal).
>> >> The client machine is on the same local network and has very similar
>> >> specs.
>> >>
>> >> The cassandra nodes are started with the following JVM options:
>> >>
>> >> ./cassandra JVM_OPTS=3D"-Xms6144m -Xmx6144m -XX:+UseConcMarkSweepGC -=
d64
>> >> -XX:NewSize=3D1024m -XX:MaxNewSize=3D1024m -XX:+DisableExplicitGC"
>> >>
>> >> I'm using default settings for all of the tunable stuff at the bottom
>> >> of storage-conf.xml. I also selected my initial tokens to evenly
>> >> partition the key space when the cluster was bootstrapped. I am using
>> >> the RandomPartitioner.
>> >>
>> >> Now, about the test. Basically I am trying to get an idea of just how
>> >> fast I can make this thing go. I am writing ~250M data records into
>> >> the cluster, replicated at 3x, using Ran Tavory's Hector client
>> >> (Java), writing with ConsistencyLevel.ZERO and
>> >> FailoverPolicy.FAIL_FAST. The client is using 32 threads with 8
>> >> threads talking to each of the 4 nodes in the cluster. Records are
>> >> identified by a numeric id, and I'm writing them in batches of up to
>> >> 10k records per row, with each record in its own column. The row key
>> >> identifies the bucket into which records fall. So, records with ids 0
>> >> - 9999 are written to row "0", 10000 - 19999 are written to row
>> >> "10000", etc. Each record is a JSON object with ~10-20 fields.
>> >>
>> >> Records: { =A0// Column Family
>> >> =A0 0 : { =A0// row key for the start of the bucket. Buckets span a r=
ange
>> >> of up to 10000 records
>> >> =A0 =A0 1 : "{ /* some JSON */ }", =A0// Column for record with id=3D=
1
>> >> =A0 =A0 3 : "{ /* some more JSON */ }", =A0// Column for record with =
id=3D3
>> >> =A0 =A0...
>> >> =A0 =A09999 : "{ /* ... */ }"
>> >> =A0 },
>> >> =A010000 : { =A0// row key for the start of the next bucket
>> >> =A0 =A010001 : ...
>> >> =A0 =A010004 :
>> >> }
>> >>
>> >> I am reading the data out of a local, sorted file on the client, so I
>> >> only write a row to Cassandra once all records for that row have been
>> >> read, and each row is written to exactly once. I'm using a
>> >> producer-consumer queue to pump data from the input reader thread to
>> >> the output writer threads. I found that I have to throttle the reader
>> >> thread heavily in order to get good behavior. So, if I make the reade=
r
>> >> sleep for 7 seconds every 1M records, everything is fine - the data
>> >> loads in about an hour, half of which is spent by the reader thread
>> >> sleeping. In between the sleeps, I see ~40-50 MB/s throughput on the
>> >> client's network interface while the reader is not sleeping, and it
>> >> takes ~7-8 seconds to write each batch of 1M records.
>> >>
>> >> Now, if I remove the 7 second sleeps on the client side, things get
>> >> bad after the first ~8M records are written to the client. Write
>> >> throughput drops to <5 MB/s. I start seeing messages about nodes
>> >> disconnecting and reconnecting in Cassandra's system.log, as well as
>> >> lots of GC messages:
>> >>
>> >> ...
>> >> =A0INFO [Timer-1] 2010-04-06 04:03:27,178 Gossiper.java (line 179)
>> >> InetAddress /10.15.38.88 is now dead.
>> >> =A0INFO [GC inspection] 2010-04-06 04:03:30,259 GCInspector.java (lin=
e
>> >> 110) GC for ConcurrentMarkSweep: 2989 ms, 55326320 reclaimed leaving
>> >> 1035998648 used; max is 1211170816
>> >> =A0INFO [GC inspection] 2010-04-06 04:03:41,838 GCInspector.java (lin=
e
>> >> 110) GC for ConcurrentMarkSweep: 3004 ms, 24377240 reclaimed leaving
>> >> 1066120952 used; max is 1211170816
>> >> =A0INFO [Timer-1] 2010-04-06 04:03:44,136 Gossiper.java (line 179)
>> >> InetAddress /10.15.38.55 is now dead.
>> >> =A0INFO [GMFD:1] 2010-04-06 04:03:44,138 Gossiper.java (line 568)
>> >> InetAddress /10.15.38.55 is now UP
>> >> =A0INFO [GC inspection] 2010-04-06 04:03:52,957 GCInspector.java (lin=
e
>> >> 110) GC for ConcurrentMarkSweep: 2319 ms, 4504888 reclaimed leaving
>> >> 1086023832 used; max is 1211170816
>> >> =A0INFO [Timer-1] 2010-04-06 04:04:19,508 Gossiper.java (line 179)
>> >> InetAddress /10.15.38.242 is now dead.
>> >> =A0INFO [Timer-1] 2010-04-06 04:05:03,039 Gossiper.java (line 179)
>> >> InetAddress /10.15.38.55 is now dead.
>> >> =A0INFO [GMFD:1] 2010-04-06 04:05:03,041 Gossiper.java (line 568)
>> >> InetAddress /10.15.38.55 is now UP
>> >> =A0INFO [GC inspection] 2010-04-06 04:05:08,539 GCInspector.java (lin=
e
>> >> 110) GC for ConcurrentMarkSweep: 2375 ms, 39534920 reclaimed leaving
>> >> 1051620856 used; max is 1211170816
>> >> ...
>> >>
>> >> Finally followed by this and some/all nodes going down:
>> >>
>> >> ERROR [COMPACTION-POOL:1] 2010-04-06 04:05:05,475
>> >> DebuggableThreadPoolExecutor.java (line 94) Error in executor
>> >> futuretask
>> >> java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError:
>> >> Java heap space
>> >> =A0 =A0 =A0 =A0at java.util.concurrent.FutureTask$Sync.innerGet(Unkno=
wn Source)
>> >> =A0 =A0 =A0 =A0at java.util.concurrent.FutureTask.get(Unknown Source)
>> >> =A0 =A0 =A0 =A0at
>> >> org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.afterExe=
cute(DebuggableThreadPoolExecutor.java:86)
>> >> =A0 =A0 =A0 =A0at
>> >> org.apache.cassandra.db.CompactionManager$CompactionExecutor.afterExe=
cute(CompactionManager.java:582)
>> >> =A0 =A0 =A0 =A0at
>> >> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source=
)
>> >> =A0 =A0 =A0 =A0at java.util.concurrent.ThreadPoolExecutor$Worker.run(=
Unknown
>> >> Source)
>> >> =A0 =A0 =A0 =A0at java.lang.Thread.run(Unknown Source)
>> >> Caused by: java.lang.OutOfMemoryError: Java heap space
>> >> =A0 =A0 =A0 =A0at java.util.Arrays.copyOf(Unknown Source)
>> >> =A0 =A0 =A0 =A0at java.io.ByteArrayOutputStream.write(Unknown Source)
>> >> =A0 =A0 =A0 =A0at java.io.DataOutputStream.write(Unknown Source)
>> >> =A0 =A0 =A0 =A0at
>> >> org.apache.cassandra.io.IteratingRow.echoData(IteratingRow.java:69)
>> >> =A0 =A0 =A0 =A0at
>> >> org.apache.cassandra.io.CompactionIterator.getReduced(CompactionItera=
tor.java:138)
>> >> =A0 =A0 =A0 =A0at
>> >> org.apache.cassandra.io.CompactionIterator.getReduced(CompactionItera=
tor.java:1)
>> >> =A0 =A0 =A0 =A0at
>> >> org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingItera=
tor.java:73)
>> >> =A0 =A0 =A0 =A0at
>> >> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractI=
terator.java:135)
>> >> =A0 =A0 =A0 =A0at
>> >> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.j=
ava:130)
>> >> =A0 =A0 =A0 =A0at
>> >> org.apache.commons.collections.iterators.FilterIterator.setNextObject=
(FilterIterator.java:183)
>> >> =A0 =A0 =A0 =A0at
>> >> org.apache.commons.collections.iterators.FilterIterator.hasNext(Filte=
rIterator.java:94)
>> >> =A0 =A0 =A0 =A0at
>> >> org.apache.cassandra.db.CompactionManager.doCompaction(CompactionMana=
ger.java:299)
>> >> =A0 =A0 =A0 =A0at
>> >> org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.ja=
va:102)
>> >> =A0 =A0 =A0 =A0at
>> >> org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.ja=
va:1)
>> >> =A0 =A0 =A0 =A0at java.util.concurrent.FutureTask$Sync.innerRun(Unkno=
wn Source)
>> >> =A0 =A0 =A0 =A0at java.util.concurrent.FutureTask.run(Unknown Source)
>> >> =A0 =A0 =A0 =A0... 3 more
>> >>
>> >> At first I thought that with ConsistencyLevel.ZERO I must be doing
>> >> async writes so Cassandra can't push back on the client threads (by
>> >> blocking them), thus the server is getting overwhelmed. But, I would
>> >> expect it to start dropping data and not crash in that case (after
>> >> all, I did say ZERO so I can't expect any reliability, right?).
>> >> However, I see similar slowdown / node dropout behavior when I set th=
e
>> >> consistency level to ONE. Does Cassandra push back on writers under
>> >> heavy load? Is there some magic setting I need to tune to have it not
>> >> fall over? Do I just need a bigger cluster? Thanks in advance,
>> >>
>> >> -- Ilya
>> >>
>> >> P.S. I realize that it's still handling a LOT of data with just 4
>> >> nodes, and in practice nobody would run a system that gets 125k write=
s
>> >> per second on top of a 4 node cluster. I was just surprised that I
>> >> could make Cassandra fall over at all using a single client that's
>> >> pumping data at 40-50 MB/s.
>> >>
>> >
>
>