Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: local policy)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=thelastpickle.com; h=from
	:mime-version:content-type:subject:date:in-reply-to:to
	:references:message-id; q=dns; s=thelastpickle.com; b=MtEV8i8t0G
	bo8C10mSf1wI6wQzha64jSWKNzu0zhRSrofd4qquk6aTU1treIyHZAA4tKC3qXdP
	q5S53Cqhj/gskSgHvCbhY49opzHkSNyJheqnu8PBopC32DKWOJp6LDy5opKsbFB1
	Z2K2AMaJfJ7OUgwjsMa2uevXJgnGttTIo=
From: aaron morton <aaron@thelastpickle.com>
Mime-Version: 1.0 (Apple Message framework v1082.1)
Content-Type: multipart/alternative; boundary=Apple-Mail-79--394713804
Subject: Re: Stress tests failed with secondary index
Date: Thu, 7 Apr 2011 23:21:56 +1000
In-Reply-To: <BANLkTi=br_Q3HhuKEVoAW87j6HtB-4Gr=g@mail.gmail.com>
To: user@cassandra.apache.org
References: <BANLkTimMxoHKwawRbb3O6RH8KpsNJPctTw@mail.gmail.com>
 <AC6A7534-B7B3-4426-9EBD-70C81A8A7295@thelastpickle.com>
 <BANLkTi=br_Q3HhuKEVoAW87j6HtB-4Gr=g@mail.gmail.com>
Message-Id: <2C459DCC-79F7-49F4-B699-1636B95218E4@thelastpickle.com>


--Apple-Mail-79--394713804
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=us-ascii

Can you turn the logging up to DEBUG level and look for a message from =
CassandraServer that says "... timed out" ?

Also check the thread pool stats "nodetool tpstats" to see if the node =
is keeping up.=20
=20
Aaron

On 7 Apr 2011, at 13:43, Sheng Chen wrote:

> Thank you Aaron.
>=20
> It does not seem to be an overload problem.
>=20
> I have 16 cores and 48G ram on the single node, and I reduced the =
concurrent threads to be 1.=20
> Still, it just suddenly dies of a timeout, while the cpu, ram, disk =
load are below 10% and write latency is about 0.5ms for the past 10 =
minutes which is really fast.
>=20
> No logs of dropped messages are found.
>=20
>=20
>=20
>=20
>=20
> 2011/4/7 aaron morton <aaron@thelastpickle.com>
> TimedOutException means that the less than CL number of nodes =
responded to the coordinator before the rpc_timeout.
>=20
> So it was overloaded. Which makes sense when you say it only happens =
with secondary indexes. Consider things like
> - reducing the throughput
> - reducing the number of clients
> - ensuring the clients are connecting to all nodes in the cluster.
>=20
> You will probably find some logs about dropped messages on some nodes.
> Aaron
>=20
> On 6 Apr 2011, at 20:39, Sheng Chen wrote:
>=20
> > I used py_stress module to insert 10m test data with a secondary =
index.
> > I got the following exceptions.
> >
> > # python stress.py -d xxx -o insert -n 10000000 -c 5 -s 34 -C 5 -x =
keys
> > total,interval_op_rate,interval_key_rate,avg_latency,elapsed_time
> > 265322,26532,26541,0.00186140829433,10
> > 630300,36497,36502,0.00129331431204,20
> > 986781,35648,35640,0.0013310986218,30
> > 1332190,34540,34534,0.00135942295893,40
> > 1473578,14138,14138,0.00142941070007,50
> > Process Inserter-38:
> > Traceback (most recent call last):
> >   File =
"/usr/lib64/python2.4/site-packages/multiprocessing/process.py", line =
237, in _bootstrap
> >     self.run()
> >   File "stress.py", line 242, in run
> >     self.cclient.batch_mutate(cfmap, consistency)
> >   File =
"/root/apache-cassandra-0.7.4-src/interface/thrift/gen-py/cassandra/Cassan=
dra.py", line 784, in batch_mutate
> > TimedOutException: TimedOutException(args=3D())
> >     self.run()
> >   File "stress.py", line 242, in run
> >     self.recv_batch_mutate()
> >   File =
"/root/apache-cassandra-0.7.4-src/interface/thrift/gen-py/cassandra/Cassan=
dra.py", line 810, in recv_batch_mutate
> >     raise result.te
> >
> >
> > Tests without secondary index is ok at about 40k ops/sec.
> >
> > There is a `GC for ParNew` for about 200ms taking place every =
second. Does it matter?
> > The same gc for about 400ms happens every 2 seconds, which does not =
hurt the inserts without secondary index.
> >
> > Thanks in advance for any advice.
> >
> > Sheng
>=20
>=20


--Apple-Mail-79--394713804
Content-Transfer-Encoding: 7bit
Content-Type: text/html;
	charset=us-ascii

<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><div>Can you turn the logging up to DEBUG level and look for a message from CassandraServer that says "... timed out" ?</div><div><br></div><div>Also check the thread pool stats "nodetool tpstats" to see if the node is keeping up.&nbsp;</div><div>&nbsp;</div><div>Aaron</div><div><br></div><div><div>On 7 Apr 2011, at 13:43, Sheng Chen wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite">Thank you Aaron.<div><br></div><div>It does not seem to be an overload problem.</div><div><br></div><div>I have 16 cores and 48G ram on the single node, and&nbsp;I reduced the concurrent threads to be 1.&nbsp;</div><div>Still, it just&nbsp;suddenly dies of a timeout, while the cpu, ram, disk load are below 10% and write latency is about 0.5ms for the past 10 minutes which is really fast.</div>
<div><br></div><div>No logs of dropped messages are found.</div><div><div><br></div><div><br></div><div><br></div><div><br><br><div class="gmail_quote">2011/4/7 aaron morton <span dir="ltr">&lt;<a href="mailto:aaron@thelastpickle.com">aaron@thelastpickle.com</a>&gt;</span><br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">TimedOutException means that the less than CL number of nodes responded to the coordinator before the rpc_timeout.<br>

<br>
So it was overloaded. Which makes sense when you say it only happens with secondary indexes. Consider things like<br>
- reducing the throughput<br>
- reducing the number of clients<br>
- ensuring the clients are connecting to all nodes in the cluster.<br>
<br>
You will probably find some logs about dropped messages on some nodes.<br>
<font color="#888888">Aaron<br>
</font><div><div></div><div class="h5"><br>
On 6 Apr 2011, at 20:39, Sheng Chen wrote:<br>
<br>
&gt; I used py_stress module to insert 10m test data with a secondary index.<br>
&gt; I got the following exceptions.<br>
&gt;<br>
&gt; # python stress.py -d xxx -o insert -n 10000000 -c 5 -s 34 -C 5 -x keys<br>
&gt; total,interval_op_rate,interval_key_rate,avg_latency,elapsed_time<br>
&gt; 265322,26532,26541,0.00186140829433,10<br>
&gt; 630300,36497,36502,0.00129331431204,20<br>
&gt; 986781,35648,35640,0.0013310986218,30<br>
&gt; 1332190,34540,34534,0.00135942295893,40<br>
&gt; 1473578,14138,14138,0.00142941070007,50<br>
&gt; Process Inserter-38:<br>
&gt; Traceback (most recent call last):<br>
&gt; &nbsp; File "/usr/lib64/python2.4/site-packages/multiprocessing/process.py", line 237, in _bootstrap<br>
&gt; &nbsp; &nbsp; self.run()<br>
&gt; &nbsp; File "stress.py", line 242, in run<br>
&gt; &nbsp; &nbsp; self.cclient.batch_mutate(cfmap, consistency)<br>
&gt; &nbsp; File "/root/apache-cassandra-0.7.4-src/interface/thrift/gen-py/cassandra/Cassandra.py", line 784, in batch_mutate<br>
&gt; TimedOutException: TimedOutException(args=())<br>
&gt; &nbsp; &nbsp; self.run()<br>
&gt; &nbsp; File "stress.py", line 242, in run<br>
&gt; &nbsp; &nbsp; self.recv_batch_mutate()<br>
&gt; &nbsp; File "/root/apache-cassandra-0.7.4-src/interface/thrift/gen-py/cassandra/Cassandra.py", line 810, in recv_batch_mutate<br>
&gt; &nbsp; &nbsp; raise result.te<br>
&gt;<br>
&gt;<br>
&gt; Tests without secondary index is ok at about 40k ops/sec.<br>
&gt;<br>
&gt; There is a `GC for ParNew` for about 200ms taking place every second. Does it matter?<br>
&gt; The same gc for about 400ms happens every 2 seconds, which does not hurt the inserts without secondary index.<br>
&gt;<br>
&gt; Thanks in advance for any advice.<br>
&gt;<br>
&gt; Sheng<br>
<br>
</div></div></blockquote></div><br></div></div>
</blockquote></div><br></body></html>
--Apple-Mail-79--394713804--