Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 40AB498D6 for ; Thu, 19 Apr 2012 20:13:44 +0000 (UTC) Received: (qmail 914 invoked by uid 500); 19 Apr 2012 20:13:41 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 897 invoked by uid 500); 19 Apr 2012 20:13:41 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 888 invoked by uid 99); 19 Apr 2012 20:13:41 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 19 Apr 2012 20:13:41 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,T_FRT_FOLLOW1 X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [208.113.200.5] (HELO homiemail-a47.g.dreamhost.com) (208.113.200.5) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 19 Apr 2012 20:13:31 +0000 Received: from homiemail-a47.g.dreamhost.com (localhost [127.0.0.1]) by homiemail-a47.g.dreamhost.com (Postfix) with ESMTP id 26CD6284058 for ; Thu, 19 Apr 2012 13:13:08 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; c=nofws; d=thelastpickle.com; h=from :mime-version:content-type:subject:date:in-reply-to:to :references:message-id; q=dns; s=thelastpickle.com; b=F6izun1Ciy et5yqlshe4J0BVxLSZDbnFezJSdGlnLw113oGxQ6X3aMjhLV+3vZzUHPjbPz0HnQ VF9Vw3nTDP328na4xX/QuhOiF48DENvepI0L39fjYKhtZ6laLJ2bFozC8ZAxjkGV gxRqGEe9F7FaudWBvXJhG4wTKuI8oUEMg= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=thelastpickle.com; h=from :mime-version:content-type:subject:date:in-reply-to:to :references:message-id; s=thelastpickle.com; bh=bPCGSphr8o31ehs0 SpvgU7VTzDA=; b=EdjypX+EtIRsYwLyv8rPAvhcGEsuPP9fLaGXLmpo4mysM6oe 2M+olGORsFLvVi5btSBQVlOsUQpWSVnEn4rLNtVXVILWlXXOl1+hoZjVw74Agbtb ggKW2uq8gJ0Xv/bUzjFYhIH+VMiKVA4j5YBMaYZFEiSFb2wIsNU/Zgy0nl8= Received: from 202-126-206-197.vectorcommunications.net.nz (unknown [202.126.206.197]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: aaron@thelastpickle.com) by homiemail-a47.g.dreamhost.com (Postfix) with ESMTPSA id 2FC47284057 for ; Thu, 19 Apr 2012 13:13:07 -0700 (PDT) From: aaron morton Mime-Version: 1.0 (Apple Message framework v1257) Content-Type: multipart/alternative; boundary="Apple-Mail=_7EBFD030-891E-4E3C-B63C-AE52E6C728B0" Subject: Re: RMI/JMX errors, weird Date: Fri, 20 Apr 2012 08:13:03 +1200 In-Reply-To: <4F90051E.1090004@bnl.gov> To: user@cassandra.apache.org References: <4F876809.6000006@rightscale.com> <4F8769A4.3050702@bnl.gov> <950D362D-ABA2-42CF-B701-7D4AF2FC9112@thelastpickle.com> <4F8F77B3.70907@bnl.gov> <5B4DAE9F-8536-4B48-8CEC-EC59E9C4C796@thelastpickle.com> <4F90051E.1090004@bnl.gov> Message-Id: X-Mailer: Apple Mail (2.1257) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail=_7EBFD030-891E-4E3C-B63C-AE52E6C728B0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=iso-8859-1 > 1150 pending tasks, and is not > making progress.=20 Not all pending tasks reported by nodetool compactionstats actually run. = Once they get a chance to run the files they were going to work on may = have already been compacted.=20 Given that repair tests at double the phi threshold, it may not make = much difference. Did other nodes notice it was dead ? Was there anything in the log that = showed it was under duress (GC or dropped message logs) ?=20 Is the compaction a consequence of repair ? (The streaming stage can = result in compactions). Or do you think the node is just behind on = compactions ?=20 If you feel compaction is hurting the node, consider setting = concurrent_compactors in the yaml to 2. You can also isolate the node from updates using nodetool disablegossip = and disablerthrift , and the turn off the IO limiter with nodetool = setcompactionthroughput 0. =20 =20 Hope that helps.=20 =20 ----------------- Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 20/04/2012, at 12:29 AM, Maxim Potekhin wrote: > Hello Aaron, >=20 > how should I go about fixing that? Also, after a repeated attempt to = compact > it goes again into "building secondary index" with 1150 pending tasks, = and is not > making progress. I suspected the disk system failure, but this needs = to be confirmed. >=20 > So basically, do I need to tune the phi threshold up? Thing is, there = was no heavy load > on the cluster at all. >=20 > Thanks >=20 > Maxim >=20 >=20 >=20 >=20 > On 4/19/2012 7:06 AM, aaron morton wrote: >>=20 >> At some point the gossip system on the node this log is from decided = that 130.199.185.195 was DOWN. This was based on how often the node was = gossiping to the cluster.=20 >>=20 >> The active repair session was informed. And to avoid failing the job = unnecessarily it tested that the errant nodes phi value was twice the = configured phi_convict_threshold. It was and the repair was killed.=20 >>=20 >> Take a look at the logs on 130.199.185.195 and see if anything was = happening on the node at the same time. Could be GC or an overloaded = node (it would log about dropped messages).=20 >>=20 >> Perhaps other nodes also saw 130.199.185.195 as down? it only needed = to be down for a few seconds.=20 >>=20 >> Hope that helps.=20 >>=20 >> ----------------- >> Aaron Morton >> Freelance Developer >> @aaronmorton >> http://www.thelastpickle.com >>=20 >> On 19/04/2012, at 2:25 PM, Maxim Potekhin wrote: >>=20 >>> Server log below. Mind you that all the nodes are still up -- even = though reported as "dead" in this log. >>> What's going on here? >>>=20 >>> Thanks! >>>=20 >>> INFO [GossipTasks:1] 2012-04-18 22:18:26,487 Gossiper.java (line = 719) InetAddress /130.199.185.193 is now dead. >>> INFO [ScheduledTasks:1] 2012-04-18 22:18:26,487 StatusLogger.java = (line 50) Pool Name Active Pending Blocked >>> ERROR [GossipTasks:1] 2012-04-18 22:18:26,488 = AntiEntropyService.java (line 722) Problem during repair session = manual-repair-1b3453b >>> 6-28b5-4abd-84ce-0326b5468064, endpoint /130.199.185.193 died >>> ERROR [RMI TCP Connection(22)-130.199.185.194] 2012-04-18 = 22:18:26,488 StorageService.java (line 1607) Repair session = org.apache.cas >>> sandra.service.AntiEntropyService$RepairSession@4cc9e2bc failed. >>> java.util.concurrent.ExecutionException: java.lang.RuntimeException: = java.io.IOException: Problem during repair session manual-repai >>> r-43545b22-ffe8-4243-8a98-509bbfec9872, endpoint /130.199.185.195 = died >>> at = java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222) >>> at java.util.concurrent.FutureTask.get(FutureTask.java:83) >>> at = org.apache.cassandra.service.StorageService.forceTableRepair(StorageServic= e.java:1603) >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native = Method) >>> at = sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:= 39) >>> at = sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorIm= pl.java:25) >>> at java.lang.reflect.Method.invoke(Method.java:597) >>> at = com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIn= trospector.java:93) >>> at = com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIn= trospector.java:27) >>> at = com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:2= 08) >>> at = com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:120) >>> at = com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:262) >>> at = com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanS= erverInterceptor.java:836) >>> at = com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:761) >>> at = javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImp= l.java:1427) >>> at = javax.management.remote.rmi.RMIConnectionImpl.access$200(RMIConnectionImpl= .java:72) >>> at = javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIC= onnectionImpl.java:1265) >>> at = javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMICon= nectionImpl.java:1360) >>> at = javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.jav= a:788) >>> at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown = Source) >>> at = sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorIm= pl.java:25) >>> at java.lang.reflect.Method.invoke(Method.java:597) >>> at = sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:303) >>> at sun.rmi.transport.Transport$1.run(Transport.java:159) >>> at java.security.AccessController.doPrivileged(Native = Method) >>> at = sun.rmi.transport.Transport.serviceCall(Transport.java:155) >>> at = sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:535) >>> at = sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.jav= a:790) >>> at = sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java= :649) >>> at = java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.= java:886) >>> at = java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java= :908) >>> at java.lang.Thread.run(Thread.java:662) >>> Caused by: java.lang.RuntimeException: java.io.IOException: Problem = during repair session manual-repair-43545b22-ffe8-4243-8a98-509b >>> bfec9872, endpoint /130.199.185.195 died >>> at = org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34) >>> at = java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) >>> at = java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) >>> at java.util.concurrent.FutureTask.run(FutureTask.java:138) >>> ... 3 more >>> Caused by: java.io.IOException: Problem during repair session = manual-repair-43545b22-ffe8-4243-8a98-509bbfec9872, endpoint /130.199. >>> 185.195 died >>> at = org.apache.cassandra.service.AntiEntropyService$RepairSession.failedNode(A= ntiEntropyService.java:723) >>> at = org.apache.cassandra.service.AntiEntropyService$RepairSession.convict(Anti= EntropyService.java:760) >>> at = org.apache.cassandra.gms.FailureDetector.interpret(FailureDetector.java:16= 5) >>> at = org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:538) >>> at = org.apache.cassandra.gms.Gossiper.access$700(Gossiper.java:57) >>> at = org.apache.cassandra.gms.Gossiper$GossipTask.run(Gossiper.java:157) >>> at = java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) >>> at = java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)= >>> at = java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150) >>> at = java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.acces= s$101(ScheduledThreadPoolExecutor.java:98) >>> at = java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPe= riodic(ScheduledThreadPoolExecutor.java:180) >>> at = java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(S= cheduledThreadPoolExecutor.java:204) >>> ... 3 more >>> INFO [FlushWriter:4] 2012-04-18 22:18:26,489 Memtable.java (line = 237) Writing Memtable-files.pandaid@988533693(1754325/140346000 se >>> rialized/live bytes, 70173 ops) >>> ERROR [AntiEntropySessions:8] 2012-04-18 22:18:26,488 = AntiEntropyService.java (line 691) Repair session = manual-repair-1b3453b6-28b5- >>> 4abd-84ce-0326b5468064 (on cfs [Ljava.lang.String;@5a44cc06, range = (113427455640312821154458202477256070484,0]) failed with the foll >>> owing error >>> java.io.IOException: Problem during repair session = manual-repair-1b3453b6-28b5-4abd-84ce-0326b5468064, endpoint = /130.199.185.193 die >>> d >>> at = org.apache.cassandra.service.AntiEntropyService$RepairSession.failedNode(A= ntiEntropyService.java:723) >>> at = org.apache.cassandra.service.AntiEntropyService$RepairSession.convict(Anti= EntropyService.java:760) >>> at = org.apache.cassandra.gms.FailureDetector.interpret(FailureDetector.java:16= 5) >>> at = org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:538) >>> at = org.apache.cassandra.gms.Gossiper.access$700(Gossiper.java:57) >>> at = org.apache.cassandra.gms.Gossiper$GossipTask.run(Gossiper.java:157) >>> at = java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) >>> at = java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)= >>> at = java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150) >>> at = java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.acces= s$101(ScheduledThreadPoolExecutor.java:98) >>> at = java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPe= riodic(ScheduledThreadPoolExecutor.java:180) >>> at = java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(S= cheduledThreadPoolExecutor.java:204) >>> at = java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.= java:886) >>> at = java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java= :908) >>> at java.lang.Thread.run(Thread.java:662) >>> INFO [ScheduledTasks:1] 2012-04-18 22:18:26,488 StatusLogger.java = (line 65) ReadStage 0 0 0 >>> ERROR [RMI TCP Connection(22)-130.199.185.194] 2012-04-18 = 22:18:26,491 StorageService.java (line 1607) Repair session = org.apache.cas >>> sandra.service.AntiEntropyService$RepairSession@41bb342c failed. >>> java.util.concurrent.ExecutionException: java.lang.RuntimeException: = java.io.IOException: Problem during repair session manual-repai >>> r-1b3453b6-28b5-4abd-84ce-0326b5468064, endpoint /130.199.185.193 = died >>> at = java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222) >>> at java.util.concurrent.FutureTask.get(FutureTask.java:83) >>> at = org.apache.cassandra.service.StorageService.forceTableRepair(StorageServic= e.java:1603) >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native = Method) >>> at = sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:= 39) >>> at = sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorIm= pl.java:25) >>> at java.lang.reflect.Method.invoke(Method.java:597) >>> at = com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIn= trospector.java:93) >>> at = com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIn= trospector.java:27) >>> at = com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:2= 08) >>> at = com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:120) >>> at = com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:262) >>> at = com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanS= erverInterceptor.java:836) >>> at = com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:761) >>> at = javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImp= l.java:1427) >>> at = javax.management.remote.rmi.RMIConnectionImpl.access$200(RMIConnectionImpl= .java:72) >>> at = javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIC= onnectionImpl.java:1265) >>> at = javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMICon= nectionImpl.java:1360) >>> at = javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.jav= a:788) >>> at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown = Source) >>> at = sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorIm= pl.java:25) >>> at java.lang.reflect.Method.invoke(Method.java:597) >>> at = sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:303) >>> at sun.rmi.transport.Transport$1.run(Transport.java:159) >>> at java.security.AccessController.doPrivileged(Native = Method) >>> at = sun.rmi.transport.Transport.serviceCall(Transport.java:155) >>> at = sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:535) >>> at = sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.jav= a:790) >>> at = sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java= :649) >>> at = java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.= java:886) >>> at = java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java= :908) >>> at java.lang.Thread.run(Thread.java:662) >>> Caused by: java.lang.RuntimeException: java.io.IOException: Problem = during repair session manual-repair-1b3453b6-28b5-4abd-84ce-0326 >>> b5468064, endpoint /130.199.185.193 died >>> at = org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34) >>> at = java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) >>> at = java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) >>> at java.util.concurrent.FutureTask.run(FutureTask.java:138) >>> ... 3 more >>> Caused by: java.io.IOException: Problem during repair session = manual-repair-1b3453b6-28b5-4abd-84ce-0326b5468064, endpoint /130.199. >>> 185.193 died >>> at = org.apache.cassandra.service.AntiEntropyService$RepairSession.failedNode(A= ntiEntropyService.java:723) >>> at = org.apache.cassandra.service.AntiEntropyService$RepairSession.convict(Anti= EntropyService.java:760) >>> at = org.apache.cassandra.gms.FailureDetector.interpret(FailureDetector.java:16= 5) >>> at = org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:538) >>> at = org.apache.cassandra.gms.Gossiper.access$700(Gossiper.java:57) >>> at = org.apache.cassandra.gms.Gossiper$GossipTask.run(Gossiper.java:157) >>> at = java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) >>> at = java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)= >>>=20 >>>=20 >>>=20 >>> On 4/12/2012 10:03 PM, aaron morton wrote: >>>>=20 >>>> Look at the server side logs for errors.=20 >>>>=20 >>>> Cheers >>>>=20 >>>> ----------------- >>>> Aaron Morton >>>> Freelance Developer >>>> @aaronmorton >>>> http://www.thelastpickle.com >>>>=20 >>>> On 13/04/2012, at 11:47 AM, Maxim Potekhin wrote: >>>>=20 >>>>> Hello, >>>>>=20 >>>>> I'm doing compactions under 0.8.8. >>>>>=20 >>>>> Recently, I started seeing a stack trace like one below, and I = can't figure out what causes this to appear. >>>>> The cluster has been in operation for mode than half a year w/o = errors like this one. >>>>>=20 >>>>> Any help will be appreciated, >>>>> Thanks >>>>>=20 >>>>> Maxim >>>>>=20 >>>>>=20 >>>>> WARNING: Failed to check the connection: = java.net.SocketTimeoutException: Read timed out >>>>> Exception in thread "main" java.io.IOException: Repair command #1: = some repair session(s) failed (see log for details). >>>>> at = org.apache.cassandra.service.StorageService.forceTableRepair(StorageServic= e.java:1613) >>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native = Method) >>>>> at = sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:= 39) >>>>> at = sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorIm= pl.java:25) >>>>> at java.lang.reflect.Method.invoke(Method.java:597) >>>>> at = com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIn= trospector.java:93) >>>>> at = com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIn= trospector.java:27) >>>>> at = com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:2= 08) >>>>> at = com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:120) >>>>> at = com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:262) >>>>> at = com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanS= erverInterceptor.java:836) >>>>> at = com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:761) >>>>> at = javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImp= l.java:1427) >>>>> at = javax.management.remote.rmi.RMIConnectionImpl.access$200(RMIConnectionImpl= .java:72) >>>>> at = javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIC= onnectionImpl.java:1265) >>>>> at = javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMICon= nectionImpl.java:1360) >>>>> at = javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.jav= a:788) >>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native = Method) >>>>> at = sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:= 39) >>>>> at = sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorIm= pl.java:25) >>>>> at java.lang.reflect.Method.invoke(Method.java:597) >>>>> at = sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:303) >>>>> at sun.rmi.transport.Transport$1.run(Transport.java:159) >>>>> at java.security.AccessController.doPrivileged(Native = Method) >>>>> at = sun.rmi.transport.Transport.serviceCall(Transport.java:155) >>>>> at = sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:535) >>>>> at = sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.jav= a:790) >>>>> at = sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java= :649) >>>>> at = java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.= java:886) >>>>> at = java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java= :908) >>>>> at java.lang.Thread.run(Thread.java:662) >>>>>=20 >>>>=20 >>>=20 >>=20 >=20 --Apple-Mail=_7EBFD030-891E-4E3C-B63C-AE52E6C728B0 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=iso-8859-1
1150 pending tasks, and is not
making = progress. 
Not all pending tasks reported by = nodetool compactionstats actually run. Once they get a chance to run the = files they were going to work on may have already been = compacted. 

Given that repair tests at double = the phi threshold, it may not make much = difference.

Did other nodes notice it was dead ? Was = there anything in the log that showed it was under duress (GC or dropped = message logs) ? 

Is the compaction a = consequence of repair ? (The streaming stage can result in compactions). = Or do you think the node is just behind on compactions = ? 

If you feel compaction is hurting the = node, consider setting concurrent_compactors in the yaml to = 2.

You can also isolate the node from updates = using nodetool disablegossip and disablerthrift , and the turn off the = IO limiter with nodetool setcompactionthroughput 0. =  
 
Hope that = helps. 
 
http://www.thelastpickle.com

On 20/04/2012, at 12:29 AM, Maxim Potekhin wrote:

=20 =20
Hello Aaron,

how should I go about fixing that? Also, after a repeated attempt to compact
it goes again into "building secondary index" with 1150 pending tasks, and is not
making progress. I suspected the disk system failure, but this needs to be confirmed.

So basically, do I need to tune the phi threshold up? Thing is, there was no heavy load
on the cluster at all.

Thanks

Maxim




On 4/19/2012 7:06 AM, aaron morton wrote:
At some point the gossip system on the node this log is from decided that 130.199.185.195 was DOWN. This was based = on how often the node was gossiping to the cluster. 

The active repair session was informed. And to avoid failing the job unnecessarily it tested that the errant nodes phi value was twice the configured phi_convict_threshold. It was and the repair was killed. 

Take a look at the logs on 130.199.185.195 and see = if anything was happening on the node at the same time. Could  be GC or an overloaded node (it would log about = dropped messages). 

Perhaps other nodes also saw 130.199.185.195 as = down? it only needed to be down for a few seconds. 

Hope that helps. 

-----------------
Aaron Morton
Freelance Developer
@aaronmorton

On 19/04/2012, at 2:25 PM, Maxim Potekhin = wrote:

Server log below. Mind you that all the nodes are still up -- even though reported as "dead" in this log.
What's going on here?

Thanks!

 INFO [GossipTasks:1] 2012-04-18 22:18:26,487 Gossiper.java (line 719) InetAddress /130.199.185.193 is now dead.
 INFO [ScheduledTasks:1] 2012-04-18 = 22:18:26,487 StatusLogger.java (line 50) Pool = Name           &nbs= p;        Active   = Pending   Blocked
ERROR [GossipTasks:1] 2012-04-18 22:18:26,488 AntiEntropyService.java (line 722) Problem during repair session manual-repair-1b3453b
6-28b5-4abd-84ce-0326b5468064, endpoint /130.199.185.193 died
ERROR [RMI TCP Connection(22)-130.199.185.194] 2012-04-18 22:18:26,488 StorageService.java (line 1607) Repair session org.apache.cas
= sandra.service.AntiEntropyService$RepairSession@4cc9e2bc failed.
java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.io.IOException: Problem during repair session manual-repai
r-43545b22-ffe8-4243-8a98-509bbfec9872, endpoint /130.199.185.195 died
        at = java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
        at = java.util.concurrent.FutureTask.get(FutureTask.java:83)
        at = org.apache.cassandra.service.StorageService.forceTableRepair(StorageServic= e.java:1603)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at = sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:= 39)
        at = sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorIm= pl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at = com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIn= trospector.java:93)
        at = com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIn= trospector.java:27)
        at = com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:2= 08)
        at = com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:120)
        at = com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:262)
        at = com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanS= erverInterceptor.java:836)
        at = com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:761)
=         at = javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImp= l.java:1427)
        at = javax.management.remote.rmi.RMIConnectionImpl.access$200(RMIConnectionImpl= .java:72)
        at = javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIC= onnectionImpl.java:1265)
        at = javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMICon= nectionImpl.java:1360)
        at = javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.jav= a:788)
        at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
        at = sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorIm= pl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at = sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:303)
        at = sun.rmi.transport.Transport$1.run(Transport.java:159)
        at java.security.AccessController.doPrivileged(Native Method)
        at = sun.rmi.transport.Transport.serviceCall(Transport.java:155)
        at = sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:535)         at = sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.jav= a:790)
        at = sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java= :649)
        at = java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.= java:886)
        at = java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java= :908)
        at = java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.RuntimeException: java.io.IOException: Problem during repair session manual-repair-43545b22-ffe8-4243-8a98-509b
bfec9872, endpoint /130.199.185.195 died
        at = org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)         at = java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)         at = java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at = java.util.concurrent.FutureTask.run(FutureTask.java:138)
        ... 3 = more
Caused by: java.io.IOException: Problem during repair session manual-repair-43545b22-ffe8-4243-8a98-509bbfec9872, endpoint /130.199.
185.195 died
        at = org.apache.cassandra.service.AntiEntropyService$RepairSession.failedNode(A= ntiEntropyService.java:723)
        at = org.apache.cassandra.service.AntiEntropyService$RepairSession.convict(Anti= EntropyService.java:760)
        at = org.apache.cassandra.gms.FailureDetector.interpret(FailureDetector.java:16= 5)
        at = org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:538)
        at = org.apache.cassandra.gms.Gossiper.access$700(Gossiper.java:57)
        at = org.apache.cassandra.gms.Gossiper$GossipTask.run(Gossiper.java:157)
        at = java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)         at = java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)=
        at = java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
        at = java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.acces= s$101(ScheduledThreadPoolExecutor.java:98)
        at = java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPe= riodic(ScheduledThreadPoolExecutor.java:180)
        at = java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(S= cheduledThreadPoolExecutor.java:204)
        ... 3 = more
 INFO [FlushWriter:4] 2012-04-18 22:18:26,489 Memtable.java (line 237) Writing Memtable-files.pandaid@988533693(1754325/140346000 se
rialized/live bytes, 70173 ops)
ERROR [AntiEntropySessions:8] 2012-04-18 22:18:26,488 AntiEntropyService.java (line 691) Repair session manual-repair-1b3453b6-28b5-
4abd-84ce-0326b5468064 (on cfs [Ljava.lang.String;@5a44cc06, range (113427455640312821154458202477256070484,0]) failed with the foll
owing error
java.io.IOException: Problem during repair session manual-repair-1b3453b6-28b5-4abd-84ce-0326b5468064, endpoint /130.199.185.193 die
d
        at = org.apache.cassandra.service.AntiEntropyService$RepairSession.failedNode(A= ntiEntropyService.java:723)
        at = org.apache.cassandra.service.AntiEntropyService$RepairSession.convict(Anti= EntropyService.java:760)
        at = org.apache.cassandra.gms.FailureDetector.interpret(FailureDetector.java:16= 5)
        at = org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:538)
        at = org.apache.cassandra.gms.Gossiper.access$700(Gossiper.java:57)
        at = org.apache.cassandra.gms.Gossiper$GossipTask.run(Gossiper.java:157)
        at = java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)         at = java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)=
        at = java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
        at = java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.acces= s$101(ScheduledThreadPoolExecutor.java:98)
        at = java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPe= riodic(ScheduledThreadPoolExecutor.java:180)
        at = java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(S= cheduledThreadPoolExecutor.java:204)
        at = java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.= java:886)
        at = java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java= :908)
        at = java.lang.Thread.run(Thread.java:662)
 INFO [ScheduledTasks:1] 2012-04-18 = 22:18:26,488 StatusLogger.java (line 65) = ReadStage           = ;            &= nbsp; 0         0         = 0
ERROR [RMI TCP Connection(22)-130.199.185.194] 2012-04-18 22:18:26,491 StorageService.java (line 1607) Repair session org.apache.cas
= sandra.service.AntiEntropyService$RepairSession@41bb342c failed.
java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.io.IOException: Problem during repair session manual-repai
r-1b3453b6-28b5-4abd-84ce-0326b5468064, endpoint /130.199.185.193 died
        at = java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
        at = java.util.concurrent.FutureTask.get(FutureTask.java:83)
        at = org.apache.cassandra.service.StorageService.forceTableRepair(StorageServic= e.java:1603)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at = sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:= 39)
        at = sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorIm= pl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at = com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIn= trospector.java:93)
        at = com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIn= trospector.java:27)
        at = com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:2= 08)
        at = com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:120)
        at = com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:262)
        at = com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanS= erverInterceptor.java:836)
        at = com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:761)
=         at = javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImp= l.java:1427)
        at = javax.management.remote.rmi.RMIConnectionImpl.access$200(RMIConnectionImpl= .java:72)
        at = javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIC= onnectionImpl.java:1265)
        at = javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMICon= nectionImpl.java:1360)
        at = javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.jav= a:788)
        at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
        at = sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorIm= pl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at = sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:303)
        at = sun.rmi.transport.Transport$1.run(Transport.java:159)
        at java.security.AccessController.doPrivileged(Native Method)
        at = sun.rmi.transport.Transport.serviceCall(Transport.java:155)
        at = sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:535)         at = sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.jav= a:790)
        at = sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java= :649)
        at = java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.= java:886)
        at = java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java= :908)
        at = java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.RuntimeException: java.io.IOException: Problem during repair session manual-repair-1b3453b6-28b5-4abd-84ce-0326
b5468064, endpoint /130.199.185.193 died
        at = org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)         at = java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)         at = java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at = java.util.concurrent.FutureTask.run(FutureTask.java:138)
        ... 3 = more
Caused by: java.io.IOException: Problem during repair session manual-repair-1b3453b6-28b5-4abd-84ce-0326b5468064, endpoint /130.199.
185.193 died
        at = org.apache.cassandra.service.AntiEntropyService$RepairSession.failedNode(A= ntiEntropyService.java:723)
        at = org.apache.cassandra.service.AntiEntropyService$RepairSession.convict(Anti= EntropyService.java:760)
        at = org.apache.cassandra.gms.FailureDetector.interpret(FailureDetector.java:16= 5)
        at = org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:538)
        at = org.apache.cassandra.gms.Gossiper.access$700(Gossiper.java:57)
        at = org.apache.cassandra.gms.Gossiper$GossipTask.run(Gossiper.java:157)
        at = java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)         at = java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)=



On 4/12/2012 10:03 PM, aaron morton wrote:
Look at the server side logs for errors. 

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton

On 13/04/2012, at 11:47 AM, Maxim Potekhin wrote:

Hello,

I'm doing compactions under 0.8.8.

Recently, I started seeing a stack trace like one below, and I can't figure out what causes this to appear.
The cluster has been in operation for mode than half a year w/o errors like this = one.

Any help will be appreciated,
Thanks

Maxim


WARNING: Failed to check the connection: java.net.SocketTimeoutException: Read timed out
Exception in thread "main" java.io.IOException: Repair command #1: some repair session(s) failed (see log for details).
=        at = org.apache.cassandra.service.StorageService.forceTableRepair(StorageServic= e.java:1613)
=        at = sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
=        at = sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:= 39)
=        at = sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorIm= pl.java:25)
=        at = java.lang.reflect.Method.invoke(Method.java:597)
=        at = com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIn= trospector.java:93)
=        at = com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIn= trospector.java:27)
=        at = com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:2= 08)
=        at = com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:120)
=        at = com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:262)
=        at = com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanS= erverInterceptor.java:836)
=        at = com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:761)
= =        at = javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImp= l.java:1427)
=        at = javax.management.remote.rmi.RMIConnectionImpl.access$200(RMIConnectionImpl= .java:72)
=        at = javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIC= onnectionImpl.java:1265)
=        at = javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMICon= nectionImpl.java:1360)
=        at = javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.jav= a:788)
=        at = sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
=        at = sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:= 39)
=        at = sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorIm= pl.java:25)
=        at = java.lang.reflect.Method.invoke(Method.java:597)
=        at = sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:303)
=        at = sun.rmi.transport.Transport$1.run(Transport.java:159)
=        at = java.security.AccessController.doPrivileged(Native Method)
=        at = sun.rmi.transport.Transport.serviceCall(Transport.java:155)
=        at = sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:535) =        at = sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.jav= a:790)
=        at = sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java= :649)
=        at = java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.= java:886)
=        at = java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java= :908)
=        at java.lang.Thread.run(Thread.java:662)






= --Apple-Mail=_7EBFD030-891E-4E3C-B63C-AE52E6C728B0--