Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of aj@birthdayalarm.com
 designates 209.85.214.172 as permitted sender)
MIME-Version: 1.0
Sender: aj@birthdayalarm.com
In-Reply-To: <AANLkTimZ4Ve7ZVousZRLwSxG7jWYnTty8G-21OewGHcS@mail.gmail.com>
References: <AANLkTik2tNG_G-xGJHK6Z7jH2SSZmggwdxOnjMe7mwEj@mail.gmail.com>
	<AANLkTineJmXJzFEaBYrdjlHeieOKjzK1QzYcmygGBpJb@mail.gmail.com>
	<AANLkTinYIO6GUYeXDPM5I869u3ITOH5cnPhi6rfcuEY3@mail.gmail.com>
	<AANLkTimWEIGIQmYFEoaEhb6QmCS8oyWJOvSF2dZfsfaK@mail.gmail.com>
	<AANLkTim68KfAAM34wmQj99oJsHWkcB3n9mhoUKWbrjmd@mail.gmail.com>
	<AANLkTiloMBG1b0K7Anr-8GBBGDyGZReV0DWU5hognA7M@mail.gmail.com>
	<AANLkTimZ4Ve7ZVousZRLwSxG7jWYnTty8G-21OewGHcS@mail.gmail.com>
Date: Sat, 19 Jun 2010 14:19:58 -0700
Message-ID: <AANLkTilDRFsRP3VPfhZukbYjnL1fMHf3rbRnoamv6Ur1@mail.gmail.com>
Subject: Re: Occasional 10s Timeouts on Read
From: AJ Slater <aj@zuno.com>
To: user@cassandra.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

I shall do just that. I did a bunch of tests this morning and the
situation appears to be this:

I have three nodes A, B and C, with RF=3D2. I understand now why this
issue wasn't apparent with RF=3D3.

If there are regular intranode column requests going on (e.g. i set up
a pinger to get remote columns), the cluster functions normally.
However, if no intranode column requests happen for a few hours, (3
hours is the minimum I've seen, but sometimes it takes longer), things
go wrong. Using node A as the point of contact from the client, all
columns that live on A are returned in a timely fashion, but for
columns that only live on B & C, the retrieval times out, with this in
the log:

INFO 13:13:28,345 error writing to /10.33.3.20

No request for replicas, or consistency checks are seen in the logs of
B & C at this time. Using 'nodetool ring' from each of the three nodes
shows all nodes as Up. Telnet from A to B on port 7000 connects.
Tcpdump logs look like, at first glance, that gossip communication,
perhaps heartbeats, are proceeding normally, but I haven't really
analyzed them.

Fifteen minutes later, the cluster decided to behave normally again.
Everyone talks to each other like buddies and delivers columns fast an
regularly.

This is really looking like a Cassandra bug. I'll report back with my
TRACE log later and I expect I'll be opening a ticket. The confidence
level of my employer in my Cassandra solution to their petabyte data
storage project is... uh... well... it could be better.

AJ


On Fri, Jun 18, 2010 at 8:16 PM, Jonathan Ellis <jbellis@gmail.com> wrote:
> set log level to TRACE and see if the OutboundTcpConnection is going
> bad. =A0that would explain the message never arriving.
>
> On Fri, Jun 18, 2010 at 10:39 AM, AJ Slater <aj@zuno.com> wrote:
>> To summarize:
>>
>> If a request for a column comes in *after a period of several hours
>> with no requests*, then the node servicing the request hangs while
>> looking for its peer rather than servicing the request like it should.
>> It then throws either a TimedOutException or a (wrong)
>> NotFoundExeption.
>>
>> And it doen't appear to actually send the message it says it does to
>> its peer. Or at least its peer doesn't report the request being
>> received.
>>
>> And then the situation magically clears up after approximately 2 minutes=
.
>>
>> However, if the idle period never occurs, then the problem does not
>> manifest. If I run a cron job with wget against my server every
>> minute, I do not see the problem.
>>
>> I'll be looking at some tcpdump logs to see if i can suss out what's
>> really happening, and perhaps file this as a bug. The several hours
>> between reproducible events makes this whole thing aggravating for
>> detection, debugging and I'll assume, fixing, if it is indeed a
>> cassandra problem.
>>
>> It was suggested on IRC that it may be my network. But gossip is
>> continually sending heartbeats and nodetool and the logs show the
>> nodes as up and available. If my network was flaking out I'd think it
>> would be dropping heartbeats and I'd see that.
>>
>> AJ
>>
>> On Thu, Jun 17, 2010 at 2:26 PM, AJ Slater <aj@zuno.com> wrote:
>>> These are physical machines.
>>>
>>> storage-conf.xml.fs03 is here:
>>>
>>> http://pastebin.com/weL41NB1
>>>
>>> Diffs from that for the other two storage-confs are inline here:
>>>
>>> aj@worm:../Z3/cassandra/conf/dev$ diff storage-conf.xml.lpc03
>>> storage-conf.xml.fs01
>>> 185c185
>>>
>>>> =A0 <InitialToken>71603818521973537678586548668074777838</InitialToken=
>
>>> 229c229
>>> < =A0 <ListenAddress>10.33.2.70</ListenAddress>
>>> ---
>>>> =A0 <ListenAddress>10.33.3.10</ListenAddress>
>>> 241c241
>>> < =A0 <ThriftAddress>10.33.2.70</ThriftAddress>
>>> ---
>>>> =A0 <ThriftAddress>10.33.3.10</ThriftAddress>
>>> 341c341
>>> < =A0 <ConcurrentReads>16</ConcurrentReads>
>>> ---
>>>> =A0 <ConcurrentReads>4</ConcurrentReads>
>>>
>>>
>>> aj@worm:../Z3/cassandra/conf/dev$ diff storage-conf.xml.lpc03
>>> storage-conf.xml.fs02
>>> 185c185
>>> < =A0 <InitialToken>0</InitialToken>
>>> ---
>>>> =A0 <InitialToken>120215585224964746744782921158327379306</InitialToke=
n>
>>> 206d205
>>> < =A0 =A0 =A0 <Seed>10.33.3.20</Seed>
>>> 229c228
>>> < =A0 <ListenAddress>10.33.2.70</ListenAddress>
>>> ---
>>>> =A0 <ListenAddress>10.33.3.20</ListenAddress>
>>> 241c240
>>> < =A0 <ThriftAddress>10.33.2.70</ThriftAddress>
>>> ---
>>>> =A0 <ThriftAddress>10.33.3.20</ThriftAddress>
>>> 341c340
>>> < =A0 <ConcurrentReads>16</ConcurrentReads>
>>> ---
>>>> =A0 <ConcurrentReads>4</ConcurrentReads>
>>>
>>>
>>> Thank you for your attention,
>>>
>>> AJ
>>>
>>>
>>> On Thu, Jun 17, 2010 at 2:09 PM, Benjamin Black <b@b3k.us> wrote:
>>>> Are these physical machines or virtuals? =A0Did you post your
>>>> cassandra.in.sh and storage-conf.xml someplace?
>>>>
>>>> On Thu, Jun 17, 2010 at 10:31 AM, AJ Slater <aj@zuno.com> wrote:
>>>>> Total data size in the entire cluster is about twenty 12k images. Wit=
h
>>>>> no other load on the system. I just ask for one column and I get thes=
e
>>>>> timeouts. Performing multiple gets on the columns leads to multiple
>>>>> timeouts for a period of a few seconds or minutes and then the
>>>>> situation magically resolves itself and response times are down to
>>>>> single digit milliseconds for a column get.
>>>>>
>>>>> On Thu, Jun 17, 2010 at 10:24 AM, AJ Slater <aj@zuno.com> wrote:
>>>>>> Cassandra 0.6.2 from the apache debian source.
>>>>>> Ubunutu Jaunty. Sun Java6 jvm.
>>>>>>
>>>>>> All nodes in separate racks at 365 main.
>>>>>>
>>>>>> On Thu, Jun 17, 2010 at 10:12 AM, AJ Slater <aj@zuno.com> wrote:
>>>>>>> I'm seing 10s timeouts on reads few times a day. Its hard to reprod=
uce
>>>>>>> consistently but seems to happen most often after its been a long t=
ime
>>>>>>> between reads. After presenting itself for a couple minutes the
>>>>>>> problem then goes away.
>>>>>>>
>>>>>>> I've got a three node cluster with replication factor 2, reading at
>>>>>>> consistency level ONE. The columns being read are around 12k each. =
The
>>>>>>> nodes are 8GB multicore boxes with the JVM limits between 4GB and 6=
GB.
>>>>>>>
>>>>>>> Here's an application log from early this morning when a developer =
in
>>>>>>> Belgrade accessed the system:
>>>>>>>
>>>>>>> Jun 17 03:54:17 lpc03 pinhole[5736]: MainThread:pinhole.py:61 |
>>>>>>> Requested image_id: 5827067133c3d670071c17d9144f0b49
>>>>>>> Jun 17 03:54:27 lpc03 pinhole[5736]: MainThread:pinhole.py:76 |
>>>>>>> TimedOutException for Image 5827067133c3d670071c17d9144f0b49
>>>>>>> Jun 17 03:54:27 lpc03 pinhole[5736]: MainThread:zlog.py:105 | Image
>>>>>>> Get took 10005.388975 ms
>>>>>>> Jun 17 03:54:27 lpc03 pinhole[5736]: MainThread:pinhole.py:61 |
>>>>>>> Requested image_id: af8caf3b76ce97d13812ddf795104a5c
>>>>>>> Jun 17 03:54:27 lpc03 pinhole[5736]: MainThread:zlog.py:105 | Image
>>>>>>> Get took 3.658056 ms
>>>>>>> Jun 17 03:54:27 lpc03 pinhole[5736]: MainThread:zlog.py:105 | Image
>>>>>>> Transform took 0.978947 ms
>>>>>>>
>>>>>>> That's a Timeout and then a successful get of another column.
>>>>>>>
>>>>>>> Here's the cassandra log for 10.33.2.70:
>>>>>>>
>>>>>>> DEBUG 03:54:17,070 get_slice
>>>>>>> DEBUG 03:54:17,071 weakreadremote reading
>>>>>>> SliceFromReadCommand(table=3D'jolitics.com',
>>>>>>> key=3D'5827067133c3d670071c17d9144f0b49',
>>>>>>> column_parent=3D'QueryPath(columnFamilyName=3D'Images',
>>>>>>> superColumnName=3D'null', columnName=3D'null')', start=3D'', finish=
=3D'
>>>>>>> ', reversed=3Dfalse, count=3D100)
>>>>>>> DEBUG 03:54:17,071 weakreadremote reading
>>>>>>> SliceFromReadCommand(table=3D'jolitics.com',
>>>>>>> key=3D'5827067133c3d670071c17d9144f0b49',
>>>>>>> column_parent=3D'QueryPath(columnFamilyName=3D'Images',
>>>>>>> superColumnName=3D'null', columnName=3D'null')', start=3D'', finish=
=3D'
>>>>>>> ', reversed=3Dfalse, count=3D100) from 45138@/10.33.3.10
>>>>>>> DEBUG 03:54:27,077 get_slice
>>>>>>> DEBUG 03:54:27,078 weakreadlocal reading
>>>>>>> SliceFromReadCommand(table=3D'jolitics.com',
>>>>>>> key=3D'af8caf3b76ce97d13812ddf795104a5c',
>>>>>>> column_parent=3D'QueryPath(columnFamilyName=3D'Images',
>>>>>>> superColumnName=3D'null', columnName=3D'null')', start=3D'', finish=
=3D''
>>>>>>> , reversed=3Dfalse, count=3D100)
>>>>>>> DEBUG 03:54:27,079 collecting body:false:1610@1275951327610885
>>>>>>> DEBUG 03:54:27,080 collecting body:false:1610@1275951327610885
>>>>>>> DEBUG 03:54:27,080 Reading consistency digest for af8caf3b76ce97d13=
812ddf795104a
>>>>>>> 5c from 45168@[/10.33.2.70, /10.33.3.10]
>>>>>>> DEBUG 03:54:50,779 Disseminating load info ...
>>>>>>>
>>>>>>> It looks like it asks for key=3D'5827067133c3d670071c17d9144f0b49' =
from
>>>>>>> the local host and also queries 10.33.3.10 for the first one and th=
en
>>>>>>> for 'af8caf3b76ce97d13812ddf795104a5c' it only queries the local ho=
st
>>>>>>> and then returns appropriately.
>>>>>>>
>>>>>>> Here's the log for 10.33.3.10 around that time:
>>>>>>>
>>>>>>> DEBUG 03:54:19,645 Disseminating load info ...
>>>>>>> DEBUG 03:55:19,645 Disseminating load info ...
>>>>>>> DEBUG 03:56:19,646 Disseminating load info ...
>>>>>>> DEBUG 03:57:19,645 Disseminating load info ...
>>>>>>> DEBUG 03:58:19,645 Disseminating load info ...
>>>>>>> DEBUG 03:59:19,646 Disseminating load info ...
>>>>>>> DEBUG 04:00:18,635 GC for ParNew: 4 ms, 21443128 reclaimed leaving
>>>>>>> 55875144 used; max is 6580535296
>>>>>>>
>>>>>>> No record of communication from 10.33.2.70.
>>>>>>>
>>>>>>> Does this ring any bells for anyone? I can of course attach
>>>>>>> storage-conf's for all nodes if that sounds useful and I'll be on
>>>>>>> #cassandra as ajslater.
>>>>>>>
>>>>>>> Much thanks for taking a look and any suggestions. We fear we'll ha=
ve
>>>>>>> to abandon Cassandra if this bug cannot be resolved.
>>>>>>>
>>>>>>> AJ
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>