Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 20274 invoked from network); 20 Jun 2010 03:10:05 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 20 Jun 2010 03:10:05 -0000 Received: (qmail 52784 invoked by uid 500); 20 Jun 2010 03:10:03 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 52682 invoked by uid 500); 20 Jun 2010 03:10:01 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 52672 invoked by uid 99); 20 Jun 2010 03:10:00 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 20 Jun 2010 03:10:00 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of aj@birthdayalarm.com designates 209.85.214.172 as permitted sender) Received: from [209.85.214.172] (HELO mail-iw0-f172.google.com) (209.85.214.172) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 20 Jun 2010 03:09:54 +0000 Received: by iwn2 with SMTP id 2so2383114iwn.31 for ; Sat, 19 Jun 2010 20:09:33 -0700 (PDT) MIME-Version: 1.0 Received: by 10.231.141.26 with SMTP id k26mr3512564ibu.163.1277003372926; Sat, 19 Jun 2010 20:09:32 -0700 (PDT) Sender: aj@birthdayalarm.com Received: by 10.231.148.15 with HTTP; Sat, 19 Jun 2010 20:09:32 -0700 (PDT) In-Reply-To: References: Date: Sat, 19 Jun 2010 20:09:32 -0700 X-Google-Sender-Auth: _Ii9QoO1vmBNifdzcySXdb4E03I Message-ID: Subject: Re: Occasional 10s Timeouts on Read From: AJ Slater To: user@cassandra.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Agreed. But those connection errors were happening at a sort of random time. Not the time when I was seeing the problem. Now I am seeing the problem and here are some logs without ConnectionExceptions. Here we're asking 10.33.2.70 for key: 52e86817a577f75e545cdecd174d8b17 which resides only on 10.33.3.10 and 10.33.3.20. Note all the apparently normal communication. Execept that no mention of a request for key 52e86817a577f75e545cdecd174d8b17 ever comes up in 10.33.3.10's log, despite 10.33.2.70 saying it was reading from 10.33.3.10 The problem resolved itself again at 20:02, maybe 20 minutes later. where all of a sudden I get my columns returned in milliseconds and I see good stuff like: DEBUG 20:06:35,238 Reading consistency digest for 52e86817a577f75e545cdecd174d8b17 from 59321@[/10.33.3.10, /10.33.3.20] Here's some logs from the problem period 10.33.2.70:/var/log/cassandra/output.log DEBUG 19:42:03,230 get_slice DEBUG 19:42:03,231 weakreadremote reading SliceFromReadCommand(table=3D'jol= itics.c om', key=3D'52e86817a577f75e545cdecd174d8b17', column_parent=3D'QueryPath(c= olumnFami lyName=3D'Images', superColumnName=3D'null', columnName=3D'null')', start= =3D'', finish=3D' ', reversed=3Dfalse, count=3D100) DEBUG 19:42:03,231 weakreadremote reading SliceFromReadCommand(table=3D'jol= itics.c om', key=3D'52e86817a577f75e545cdecd174d8b17', column_parent=3D'QueryPath(c= olumnFami lyName=3D'Images', superColumnName=3D'null', columnName=3D'null')', start= =3D'', finish=3D' ', reversed=3Dfalse, count=3D100) from 57663@/10.33.3.10 TRACE 19:42:03,619 Gossip Digests are : /10.33.2.70:1276981671:20386 /10.33= .3.10 :1276983719:18303 /10.33.3.20:1276983726:18295 /10.33.2.70:1276981671:20386 TRACE 19:42:03,619 Sending a GossipDigestSynMessage to /10.33.3.20 ... TRACE 19:42:03,619 Performing status check ... TRACE 19:42:03,619 PHI for /10.33.3.10 : 0.95343619570936 TRACE 19:42:03,619 PHI for /10.33.3.20 : 0.8635116192106644 TRACE 19:42:03,621 Received a GossipDigestAckMessage from /10.33.3.20 TRACE 19:42:03,621 reporting /10.33.3.10 TRACE 19:42:03,621 reporting /10.33.3.20 TRACE 19:42:03,621 marking as alive /10.33.3.10 TRACE 19:42:03,621 Updating heartbeat state version to 18304 from 18303 for= /10. 33.3.10 ... TRACE 19:42:03,621 marking as alive /10.33.3.20 TRACE 19:42:03,621 Updating heartbeat state version to 18296 from 18295 for= /10. 33.3.20 ... TRACE 19:42:03,622 Scanning for state greater than 20385 for /10.33.2.70 TRACE 19:42:03,622 Scanning for state greater than 20385 for /10.33.2.70 TRACE 19:42:03,622 Sending a GossipDigestAck2Message to /10.33.3.20 TRACE 19:42:04,172 Received a GossipDigestSynMessage from /10.33.3.10 TRACE 19:42:04,172 reporting /10.33.3.10 TRACE 19:42:04,172 reporting /10.33.3.10 TRACE 19:42:04,172 Scanning for state greater than 20385 for /10.33.2.70 TRACE 19:42:04,172 @@@@ Size of GossipDigestAckMessage is 52 TRACE 19:42:04,172 Sending a GossipDigestAckMessage to /10.33.3.10 TRACE 19:42:04,174 Received a GossipDigestAck2Message from /10.33.3.10 TRACE 19:42:04,174 reporting /10.33.3.10 TRACE 19:42:04,174 marking as alive /10.33.3.10 TRACE 19:42:04,174 Updating heartbeat state version to 18305 from 18304 for= /10. 33.3.10 ... 10.33.3.10:/var/log/cassandra/output.log TRACE 19:42:03,174 Sending a GossipDigestSynMessage to /10.33.3.20 ... TRACE 19:42:03,174 Performing status check ... TRACE 19:42:03,174 PHI for /10.33.2.70 : 1.3363463863632534 TRACE 19:42:03,174 PHI for /10.33.3.20 : 0.9297110501502452 TRACE 19:42:03,175 Received a GossipDigestAckMessage from /10.33.3.20 TRACE 19:42:03,176 reporting /10.33.2.70 TRACE 19:42:03,176 marking as alive /10.33.2.70 TRACE 19:42:03,176 Updating heartbeat state version to 20385 from 20384 for= /10. 33.2.70 ... TRACE 19:42:03,176 Scanning for state greater than 18303 for /10.33.3.10 TRACE 19:42:03,176 Scanning for state greater than 18303 for /10.33.3.10 TRACE 19:42:03,176 Sending a GossipDigestAck2Message to /10.33.3.20 TRACE 19:42:03,230 Received a GossipDigestSynMessage from /10.33.3.20 TRACE 19:42:03,230 reporting /10.33.3.20 TRACE 19:42:03,231 reporting /10.33.3.20 TRACE 19:42:03,231 @@@@ Size of GossipDigestAckMessage is 35 TRACE 19:42:03,231 Sending a GossipDigestAckMessage to /10.33.3.20 TRACE 19:42:03,232 Received a GossipDigestAck2Message from /10.33.3.20 TRACE 19:42:03,232 reporting /10.33.3.20 TRACE 19:42:03,232 marking as alive /10.33.3.20 TRACE 19:42:03,232 Updating heartbeat state version to 18296 from 18295 for= /10. 33.3.20 ... TRACE 19:42:04,173 Gossip Digests are : /10.33.3.10:1276983719:18305 /10.33= .2.70 10.33.3.20:/var/log/cassandra/output.log TRACE 19:42:03,174 Received a GossipDigestSynMessage from /10.33.3.10 TRACE 19:42:03,174 reporting /10.33.3.10 TRACE 19:42:03,174 reporting /10.33.3.10 TRACE 19:42:03,174 Scanning for state greater than 20384 for /10.33.2.70 TRACE 19:42:03,175 @@@@ Size of GossipDigestAckMessage is 52 TRACE 19:42:03,175 Sending a GossipDigestAckMessage to /10.33.3.10 TRACE 19:42:03,176 Received a GossipDigestAck2Message from /10.33.3.10 TRACE 19:42:03,176 reporting /10.33.3.10 TRACE 19:42:03,177 marking as alive /10.33.3.10 TRACE 19:42:03,177 Updating heartbeat state version to 18304 from 18303 for= /10. 33.3.10 ... TRACE 19:42:03,229 Gossip Digests are : /10.33.3.20:1276983726:18296 /10.33= .3.10 :1276983719:18304 /10.33.3.20:1276983726:18296 /10.33.2.70:1276981671:20385 TRACE 19:42:03,229 Sending a GossipDigestSynMessage to /10.33.3.10 ... TRACE 19:42:03,229 Performing status check ... TRACE 19:42:03,229 PHI for /10.33.2.70 : 0.5938079948279411 TRACE 19:42:03,229 PHI for /10.33.3.10 : 0.045531699282787594 TRACE 19:42:03,231 Received a GossipDigestAckMessage from /10.33.3.10 TRACE 19:42:03,231 Scanning for state greater than 18295 for /10.33.3.20 TRACE 19:42:03,231 Scanning for state greater than 18295 for /10.33.3.20 TRACE 19:42:03,232 Sending a GossipDigestAck2Message to /10.33.3.10 TRACE 19:42:03,622 Received a GossipDigestSynMessage from /10.33.2.70 TRACE 19:42:03,622 reporting /10.33.2.70 TRACE 19:42:03,622 reporting /10.33.2.70 TRACE 19:42:03,622 Scanning for state greater than 18295 for /10.33.3.20 TRACE 19:42:03,623 Scanning for state greater than 18303 for /10.33.3.10 TRACE 19:42:03,623 @@@@ Size of GossipDigestAckMessage is 69 TRACE 19:42:03,623 Sending a GossipDigestAckMessage to /10.33.2.70 TRACE 19:42:03,625 Received a GossipDigestAck2Message from /10.33.2.70 TRACE 19:42:03,625 reporting /10.33.2.70 TRACE 19:42:03,625 marking as alive /10.33.2.70 TRACE 19:42:03,625 Updating heartbeat state version to 20386 from 20385 for= /10. 33.2.70 ... TRACE 19:42:04,229 Gossip Digests are : /10.33.3.20:1276983726:18297 /10.33= .2.70 :1276981671:20386 /10.33.3.10:1276983719:18304 /10.33.3.20:1276983726:18297 TRACE 19:42:04,229 Sending a GossipDigestSynMessage to /10.33.3.10 ... TRACE 19:42:04,229 Performing status check ... AJ On Sat, Jun 19, 2010 at 7:02 PM, Jonathan Ellis wrote: > This is definitely not a Cassandra bug, something external is causing > those connection failures. > > On Sat, Jun 19, 2010 at 3:12 PM, AJ Slater wrote: >> Logging with TRACE reveals immediate problems with no client requests >> coming in to the servers. The problem was immediate and persisted over >> the course of =A0half an hour: >> >> 10.33.2.70 =A0 lpc03 >> 10.33.3.10 =A0 fs01 >> 10.33.3.20 =A0 fs02 >> >> aj@lpc03:~$ grep unable /var/log/cassandra/output.log >> TRACE 14:07:52,104 unable to connect to /10.33.3.10 >> ... >> TRACE 14:42:00,008 unable to connect to /10.33.3.20 >> ... >> TRACE 14:42:06,751 unable to connect to /10.33.3.20 >> >> Note that lpc03 has trouble talking to fs01 and fs02. But after After >> seeing this I started logging TRACE on fs01 and fs02. >> >> During the six seconds before I restarted fs02: >> >> aj@fs01:~/logs$ grep unable /var/log/cassandra/output.log | grep unable >> Bad configuration; unable to start server >> TRACE 14:42:00,865 unable to connect to /10.33.3.20 >> ... >> TRACE 14:42:06,730 unable to connect to /10.33.3.20 >> >> Restarted fs02 and no issues in any of the logs. >> >> aj@fs02:~$ grep unable /var/log/cassandra/output.log >> aj@fs02:~$ >> >> >> >> The unfiltered log messages all look more like: >> >> TRACE 14:42:06,248 unable to connect to /10.33.3.20 >> java.net.ConnectException: Connection refused >> =A0 =A0 =A0 =A0at java.net.PlainSocketImpl.socketConnect(Native Method) >> =A0 =A0 =A0 =A0at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.jav= a:333) >> =A0 =A0 =A0 =A0at java.net.PlainSocketImpl.connectToAddress(PlainSocketI= mpl.java:195) >> =A0 =A0 =A0 =A0at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:= 182) >> =A0 =A0 =A0 =A0at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:= 366) >> =A0 =A0 =A0 =A0at java.net.Socket.connect(Socket.java:529) >> =A0 =A0 =A0 =A0at java.net.Socket.connect(Socket.java:478) >> =A0 =A0 =A0 =A0at java.net.Socket.(Socket.java:375) >> =A0 =A0 =A0 =A0at java.net.Socket.(Socket.java:276) >> =A0 =A0 =A0 =A0at org.apache.cassandra.net.OutboundTcpConnection.connect= (OutboundTcpCon >> nection.java:149) >> =A0 =A0 =A0 =A0at org.apache.cassandra.net.OutboundTcpConnection.run(Out= boundTcpConnect >> ion.java:85) >> >> >> On Sat, Jun 19, 2010 at 2:19 PM, AJ Slater wrote: >>> I shall do just that. I did a bunch of tests this morning and the >>> situation appears to be this: >>> >>> I have three nodes A, B and C, with RF=3D2. I understand now why this >>> issue wasn't apparent with RF=3D3. >>> >>> If there are regular intranode column requests going on (e.g. i set up >>> a pinger to get remote columns), the cluster functions normally. >>> However, if no intranode column requests happen for a few hours, (3 >>> hours is the minimum I've seen, but sometimes it takes longer), things >>> go wrong. Using node A as the point of contact from the client, all >>> columns that live on A are returned in a timely fashion, but for >>> columns that only live on B & C, the retrieval times out, with this in >>> the log: >>> >>> INFO 13:13:28,345 error writing to /10.33.3.20 >>> >>> No request for replicas, or consistency checks are seen in the logs of >>> B & C at this time. Using 'nodetool ring' from each of the three nodes >>> shows all nodes as Up. Telnet from A to B on port 7000 connects. >>> Tcpdump logs look like, at first glance, that gossip communication, >>> perhaps heartbeats, are proceeding normally, but I haven't really >>> analyzed them. >>> >>> Fifteen minutes later, the cluster decided to behave normally again. >>> Everyone talks to each other like buddies and delivers columns fast an >>> regularly. >>> >>> This is really looking like a Cassandra bug. I'll report back with my >>> TRACE log later and I expect I'll be opening a ticket. The confidence >>> level of my employer in my Cassandra solution to their petabyte data >>> storage project is... uh... well... it could be better. >>> >>> AJ >>> >>> >>> On Fri, Jun 18, 2010 at 8:16 PM, Jonathan Ellis wro= te: >>>> set log level to TRACE and see if the OutboundTcpConnection is going >>>> bad. =A0that would explain the message never arriving. >>>> >>>> On Fri, Jun 18, 2010 at 10:39 AM, AJ Slater wrote: >>>>> To summarize: >>>>> >>>>> If a request for a column comes in *after a period of several hours >>>>> with no requests*, then the node servicing the request hangs while >>>>> looking for its peer rather than servicing the request like it should= . >>>>> It then throws either a TimedOutException or a (wrong) >>>>> NotFoundExeption. >>>>> >>>>> And it doen't appear to actually send the message it says it does to >>>>> its peer. Or at least its peer doesn't report the request being >>>>> received. >>>>> >>>>> And then the situation magically clears up after approximately 2 minu= tes. >>>>> >>>>> However, if the idle period never occurs, then the problem does not >>>>> manifest. If I run a cron job with wget against my server every >>>>> minute, I do not see the problem. >>>>> >>>>> I'll be looking at some tcpdump logs to see if i can suss out what's >>>>> really happening, and perhaps file this as a bug. The several hours >>>>> between reproducible events makes this whole thing aggravating for >>>>> detection, debugging and I'll assume, fixing, if it is indeed a >>>>> cassandra problem. >>>>> >>>>> It was suggested on IRC that it may be my network. But gossip is >>>>> continually sending heartbeats and nodetool and the logs show the >>>>> nodes as up and available. If my network was flaking out I'd think it >>>>> would be dropping heartbeats and I'd see that. >>>>> >>>>> AJ >>>>> >>>>> On Thu, Jun 17, 2010 at 2:26 PM, AJ Slater wrote: >>>>>> These are physical machines. >>>>>> >>>>>> storage-conf.xml.fs03 is here: >>>>>> >>>>>> http://pastebin.com/weL41NB1 >>>>>> >>>>>> Diffs from that for the other two storage-confs are inline here: >>>>>> >>>>>> aj@worm:../Z3/cassandra/conf/dev$ diff storage-conf.xml.lpc03 >>>>>> storage-conf.xml.fs01 >>>>>> 185c185 >>>>>> >>>>>>> =A0 71603818521973537678586548668074777838 >>>>>> 229c229 >>>>>> < =A0 10.33.2.70 >>>>>> --- >>>>>>> =A0 10.33.3.10 >>>>>> 241c241 >>>>>> < =A0 10.33.2.70 >>>>>> --- >>>>>>> =A0 10.33.3.10 >>>>>> 341c341 >>>>>> < =A0 16 >>>>>> --- >>>>>>> =A0 4 >>>>>> >>>>>> >>>>>> aj@worm:../Z3/cassandra/conf/dev$ diff storage-conf.xml.lpc03 >>>>>> storage-conf.xml.fs02 >>>>>> 185c185 >>>>>> < =A0 0 >>>>>> --- >>>>>>> =A0 120215585224964746744782921158327379306 >>>>>> 206d205 >>>>>> < =A0 =A0 =A0 10.33.3.20 >>>>>> 229c228 >>>>>> < =A0 10.33.2.70 >>>>>> --- >>>>>>> =A0 10.33.3.20 >>>>>> 241c240 >>>>>> < =A0 10.33.2.70 >>>>>> --- >>>>>>> =A0 10.33.3.20 >>>>>> 341c340 >>>>>> < =A0 16 >>>>>> --- >>>>>>> =A0 4 >>>>>> >>>>>> >>>>>> Thank you for your attention, >>>>>> >>>>>> AJ >>>>>> >>>>>> >>>>>> On Thu, Jun 17, 2010 at 2:09 PM, Benjamin Black wrote: >>>>>>> Are these physical machines or virtuals? =A0Did you post your >>>>>>> cassandra.in.sh and storage-conf.xml someplace? >>>>>>> >>>>>>> On Thu, Jun 17, 2010 at 10:31 AM, AJ Slater wrote: >>>>>>>> Total data size in the entire cluster is about twenty 12k images. = With >>>>>>>> no other load on the system. I just ask for one column and I get t= hese >>>>>>>> timeouts. Performing multiple gets on the columns leads to multipl= e >>>>>>>> timeouts for a period of a few seconds or minutes and then the >>>>>>>> situation magically resolves itself and response times are down to >>>>>>>> single digit milliseconds for a column get. >>>>>>>> >>>>>>>> On Thu, Jun 17, 2010 at 10:24 AM, AJ Slater wrote: >>>>>>>>> Cassandra 0.6.2 from the apache debian source. >>>>>>>>> Ubunutu Jaunty. Sun Java6 jvm. >>>>>>>>> >>>>>>>>> All nodes in separate racks at 365 main. >>>>>>>>> >>>>>>>>> On Thu, Jun 17, 2010 at 10:12 AM, AJ Slater wrote: >>>>>>>>>> I'm seing 10s timeouts on reads few times a day. Its hard to rep= roduce >>>>>>>>>> consistently but seems to happen most often after its been a lon= g time >>>>>>>>>> between reads. After presenting itself for a couple minutes the >>>>>>>>>> problem then goes away. >>>>>>>>>> >>>>>>>>>> I've got a three node cluster with replication factor 2, reading= at >>>>>>>>>> consistency level ONE. The columns being read are around 12k eac= h. The >>>>>>>>>> nodes are 8GB multicore boxes with the JVM limits between 4GB an= d 6GB. >>>>>>>>>> >>>>>>>>>> Here's an application log from early this morning when a develop= er in >>>>>>>>>> Belgrade accessed the system: >>>>>>>>>> >>>>>>>>>> Jun 17 03:54:17 lpc03 pinhole[5736]: MainThread:pinhole.py:61 | >>>>>>>>>> Requested image_id: 5827067133c3d670071c17d9144f0b49 >>>>>>>>>> Jun 17 03:54:27 lpc03 pinhole[5736]: MainThread:pinhole.py:76 | >>>>>>>>>> TimedOutException for Image 5827067133c3d670071c17d9144f0b49 >>>>>>>>>> Jun 17 03:54:27 lpc03 pinhole[5736]: MainThread:zlog.py:105 | Im= age >>>>>>>>>> Get took 10005.388975 ms >>>>>>>>>> Jun 17 03:54:27 lpc03 pinhole[5736]: MainThread:pinhole.py:61 | >>>>>>>>>> Requested image_id: af8caf3b76ce97d13812ddf795104a5c >>>>>>>>>> Jun 17 03:54:27 lpc03 pinhole[5736]: MainThread:zlog.py:105 | Im= age >>>>>>>>>> Get took 3.658056 ms >>>>>>>>>> Jun 17 03:54:27 lpc03 pinhole[5736]: MainThread:zlog.py:105 | Im= age >>>>>>>>>> Transform took 0.978947 ms >>>>>>>>>> >>>>>>>>>> That's a Timeout and then a successful get of another column. >>>>>>>>>> >>>>>>>>>> Here's the cassandra log for 10.33.2.70: >>>>>>>>>> >>>>>>>>>> DEBUG 03:54:17,070 get_slice >>>>>>>>>> DEBUG 03:54:17,071 weakreadremote reading >>>>>>>>>> SliceFromReadCommand(table=3D'jolitics.com', >>>>>>>>>> key=3D'5827067133c3d670071c17d9144f0b49', >>>>>>>>>> column_parent=3D'QueryPath(columnFamilyName=3D'Images', >>>>>>>>>> superColumnName=3D'null', columnName=3D'null')', start=3D'', fin= ish=3D' >>>>>>>>>> ', reversed=3Dfalse, count=3D100) >>>>>>>>>> DEBUG 03:54:17,071 weakreadremote reading >>>>>>>>>> SliceFromReadCommand(table=3D'jolitics.com', >>>>>>>>>> key=3D'5827067133c3d670071c17d9144f0b49', >>>>>>>>>> column_parent=3D'QueryPath(columnFamilyName=3D'Images', >>>>>>>>>> superColumnName=3D'null', columnName=3D'null')', start=3D'', fin= ish=3D' >>>>>>>>>> ', reversed=3Dfalse, count=3D100) from 45138@/10.33.3.10 >>>>>>>>>> DEBUG 03:54:27,077 get_slice >>>>>>>>>> DEBUG 03:54:27,078 weakreadlocal reading >>>>>>>>>> SliceFromReadCommand(table=3D'jolitics.com', >>>>>>>>>> key=3D'af8caf3b76ce97d13812ddf795104a5c', >>>>>>>>>> column_parent=3D'QueryPath(columnFamilyName=3D'Images', >>>>>>>>>> superColumnName=3D'null', columnName=3D'null')', start=3D'', fin= ish=3D'' >>>>>>>>>> , reversed=3Dfalse, count=3D100) >>>>>>>>>> DEBUG 03:54:27,079 collecting body:false:1610@1275951327610885 >>>>>>>>>> DEBUG 03:54:27,080 collecting body:false:1610@1275951327610885 >>>>>>>>>> DEBUG 03:54:27,080 Reading consistency digest for af8caf3b76ce97= d13812ddf795104a >>>>>>>>>> 5c from 45168@[/10.33.2.70, /10.33.3.10] >>>>>>>>>> DEBUG 03:54:50,779 Disseminating load info ... >>>>>>>>>> >>>>>>>>>> It looks like it asks for key=3D'5827067133c3d670071c17d9144f0b4= 9' from >>>>>>>>>> the local host and also queries 10.33.3.10 for the first one and= then >>>>>>>>>> for 'af8caf3b76ce97d13812ddf795104a5c' it only queries the local= host >>>>>>>>>> and then returns appropriately. >>>>>>>>>> >>>>>>>>>> Here's the log for 10.33.3.10 around that time: >>>>>>>>>> >>>>>>>>>> DEBUG 03:54:19,645 Disseminating load info ... >>>>>>>>>> DEBUG 03:55:19,645 Disseminating load info ... >>>>>>>>>> DEBUG 03:56:19,646 Disseminating load info ... >>>>>>>>>> DEBUG 03:57:19,645 Disseminating load info ... >>>>>>>>>> DEBUG 03:58:19,645 Disseminating load info ... >>>>>>>>>> DEBUG 03:59:19,646 Disseminating load info ... >>>>>>>>>> DEBUG 04:00:18,635 GC for ParNew: 4 ms, 21443128 reclaimed leavi= ng >>>>>>>>>> 55875144 used; max is 6580535296 >>>>>>>>>> >>>>>>>>>> No record of communication from 10.33.2.70. >>>>>>>>>> >>>>>>>>>> Does this ring any bells for anyone? I can of course attach >>>>>>>>>> storage-conf's for all nodes if that sounds useful and I'll be o= n >>>>>>>>>> #cassandra as ajslater. >>>>>>>>>> >>>>>>>>>> Much thanks for taking a look and any suggestions. We fear we'll= have >>>>>>>>>> to abandon Cassandra if this bug cannot be resolved. >>>>>>>>>> >>>>>>>>>> AJ >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> Jonathan Ellis >>>> Project Chair, Apache Cassandra >>>> co-founder of Riptano, the source for professional Cassandra support >>>> http://riptano.com >>>> >>> >> > > > > -- > Jonathan Ellis > Project Chair, Apache Cassandra > co-founder of Riptano, the source for professional Cassandra support > http://riptano.com >