Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 32708 invoked from network); 17 Jun 2010 21:27:09 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 17 Jun 2010 21:27:09 -0000 Received: (qmail 89445 invoked by uid 500); 17 Jun 2010 21:27:08 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 89359 invoked by uid 500); 17 Jun 2010 21:27:08 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 89351 invoked by uid 99); 17 Jun 2010 21:27:08 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Jun 2010 21:27:08 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of aj@birthdayalarm.com designates 209.85.214.172 as permitted sender) Received: from [209.85.214.172] (HELO mail-iw0-f172.google.com) (209.85.214.172) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Jun 2010 21:27:02 +0000 Received: by iwn2 with SMTP id 2so308155iwn.31 for ; Thu, 17 Jun 2010 14:26:41 -0700 (PDT) MIME-Version: 1.0 Received: by 10.42.5.207 with SMTP id 15mr41853icx.59.1276810000214; Thu, 17 Jun 2010 14:26:40 -0700 (PDT) Sender: aj@birthdayalarm.com Received: by 10.231.148.15 with HTTP; Thu, 17 Jun 2010 14:26:40 -0700 (PDT) In-Reply-To: References: Date: Thu, 17 Jun 2010 14:26:40 -0700 X-Google-Sender-Auth: heljbh7qVOfAdxEA9o8TCyyaDr0 Message-ID: Subject: Re: Occasional 10s Timeouts on Read From: AJ Slater To: user@cassandra.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org These are physical machines. storage-conf.xml.fs03 is here: http://pastebin.com/weL41NB1 Diffs from that for the other two storage-confs are inline here: aj@worm:../Z3/cassandra/conf/dev$ diff storage-conf.xml.lpc03 storage-conf.xml.fs01 185c185 > 71603818521973537678586548668074777838 229c229 < 10.33.2.70 --- > 10.33.3.10 241c241 < 10.33.2.70 --- > 10.33.3.10 341c341 < 16 --- > 4 aj@worm:../Z3/cassandra/conf/dev$ diff storage-conf.xml.lpc03 storage-conf.xml.fs02 185c185 < 0 --- > 120215585224964746744782921158327379306 206d205 < 10.33.3.20 229c228 < 10.33.2.70 --- > 10.33.3.20 241c240 < 10.33.2.70 --- > 10.33.3.20 341c340 < 16 --- > 4 Thank you for your attention, AJ On Thu, Jun 17, 2010 at 2:09 PM, Benjamin Black wrote: > Are these physical machines or virtuals? =A0Did you post your > cassandra.in.sh and storage-conf.xml someplace? > > On Thu, Jun 17, 2010 at 10:31 AM, AJ Slater wrote: >> Total data size in the entire cluster is about twenty 12k images. With >> no other load on the system. I just ask for one column and I get these >> timeouts. Performing multiple gets on the columns leads to multiple >> timeouts for a period of a few seconds or minutes and then the >> situation magically resolves itself and response times are down to >> single digit milliseconds for a column get. >> >> On Thu, Jun 17, 2010 at 10:24 AM, AJ Slater wrote: >>> Cassandra 0.6.2 from the apache debian source. >>> Ubunutu Jaunty. Sun Java6 jvm. >>> >>> All nodes in separate racks at 365 main. >>> >>> On Thu, Jun 17, 2010 at 10:12 AM, AJ Slater wrote: >>>> I'm seing 10s timeouts on reads few times a day. Its hard to reproduce >>>> consistently but seems to happen most often after its been a long time >>>> between reads. After presenting itself for a couple minutes the >>>> problem then goes away. >>>> >>>> I've got a three node cluster with replication factor 2, reading at >>>> consistency level ONE. The columns being read are around 12k each. The >>>> nodes are 8GB multicore boxes with the JVM limits between 4GB and 6GB. >>>> >>>> Here's an application log from early this morning when a developer in >>>> Belgrade accessed the system: >>>> >>>> Jun 17 03:54:17 lpc03 pinhole[5736]: MainThread:pinhole.py:61 | >>>> Requested image_id: 5827067133c3d670071c17d9144f0b49 >>>> Jun 17 03:54:27 lpc03 pinhole[5736]: MainThread:pinhole.py:76 | >>>> TimedOutException for Image 5827067133c3d670071c17d9144f0b49 >>>> Jun 17 03:54:27 lpc03 pinhole[5736]: MainThread:zlog.py:105 | Image >>>> Get took 10005.388975 ms >>>> Jun 17 03:54:27 lpc03 pinhole[5736]: MainThread:pinhole.py:61 | >>>> Requested image_id: af8caf3b76ce97d13812ddf795104a5c >>>> Jun 17 03:54:27 lpc03 pinhole[5736]: MainThread:zlog.py:105 | Image >>>> Get took 3.658056 ms >>>> Jun 17 03:54:27 lpc03 pinhole[5736]: MainThread:zlog.py:105 | Image >>>> Transform took 0.978947 ms >>>> >>>> That's a Timeout and then a successful get of another column. >>>> >>>> Here's the cassandra log for 10.33.2.70: >>>> >>>> DEBUG 03:54:17,070 get_slice >>>> DEBUG 03:54:17,071 weakreadremote reading >>>> SliceFromReadCommand(table=3D'jolitics.com', >>>> key=3D'5827067133c3d670071c17d9144f0b49', >>>> column_parent=3D'QueryPath(columnFamilyName=3D'Images', >>>> superColumnName=3D'null', columnName=3D'null')', start=3D'', finish=3D= ' >>>> ', reversed=3Dfalse, count=3D100) >>>> DEBUG 03:54:17,071 weakreadremote reading >>>> SliceFromReadCommand(table=3D'jolitics.com', >>>> key=3D'5827067133c3d670071c17d9144f0b49', >>>> column_parent=3D'QueryPath(columnFamilyName=3D'Images', >>>> superColumnName=3D'null', columnName=3D'null')', start=3D'', finish=3D= ' >>>> ', reversed=3Dfalse, count=3D100) from 45138@/10.33.3.10 >>>> DEBUG 03:54:27,077 get_slice >>>> DEBUG 03:54:27,078 weakreadlocal reading >>>> SliceFromReadCommand(table=3D'jolitics.com', >>>> key=3D'af8caf3b76ce97d13812ddf795104a5c', >>>> column_parent=3D'QueryPath(columnFamilyName=3D'Images', >>>> superColumnName=3D'null', columnName=3D'null')', start=3D'', finish=3D= '' >>>> , reversed=3Dfalse, count=3D100) >>>> DEBUG 03:54:27,079 collecting body:false:1610@1275951327610885 >>>> DEBUG 03:54:27,080 collecting body:false:1610@1275951327610885 >>>> DEBUG 03:54:27,080 Reading consistency digest for af8caf3b76ce97d13812= ddf795104a >>>> 5c from 45168@[/10.33.2.70, /10.33.3.10] >>>> DEBUG 03:54:50,779 Disseminating load info ... >>>> >>>> It looks like it asks for key=3D'5827067133c3d670071c17d9144f0b49' fro= m >>>> the local host and also queries 10.33.3.10 for the first one and then >>>> for 'af8caf3b76ce97d13812ddf795104a5c' it only queries the local host >>>> and then returns appropriately. >>>> >>>> Here's the log for 10.33.3.10 around that time: >>>> >>>> DEBUG 03:54:19,645 Disseminating load info ... >>>> DEBUG 03:55:19,645 Disseminating load info ... >>>> DEBUG 03:56:19,646 Disseminating load info ... >>>> DEBUG 03:57:19,645 Disseminating load info ... >>>> DEBUG 03:58:19,645 Disseminating load info ... >>>> DEBUG 03:59:19,646 Disseminating load info ... >>>> DEBUG 04:00:18,635 GC for ParNew: 4 ms, 21443128 reclaimed leaving >>>> 55875144 used; max is 6580535296 >>>> >>>> No record of communication from 10.33.2.70. >>>> >>>> Does this ring any bells for anyone? I can of course attach >>>> storage-conf's for all nodes if that sounds useful and I'll be on >>>> #cassandra as ajslater. >>>> >>>> Much thanks for taking a look and any suggestions. We fear we'll have >>>> to abandon Cassandra if this bug cannot be resolved. >>>> >>>> AJ >>>> >>> >> >