Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6C9D9E95A for ; Wed, 13 Mar 2013 23:28:29 +0000 (UTC) Received: (qmail 92375 invoked by uid 500); 13 Mar 2013 23:28:26 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 92311 invoked by uid 500); 13 Mar 2013 23:28:26 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 92301 invoked by uid 99); 13 Mar 2013 23:28:26 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Mar 2013 23:28:26 +0000 X-ASF-Spam-Status: No, hits=2.8 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS,URIBL_BLACK X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of dane@optimalsocial.com designates 209.85.210.171 as permitted sender) Received: from [209.85.210.171] (HELO mail-ia0-f171.google.com) (209.85.210.171) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Mar 2013 23:28:22 +0000 Received: by mail-ia0-f171.google.com with SMTP id z13so1508289iaz.16 for ; Wed, 13 Mar 2013 16:28:01 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:cc:content-type:x-gm-message-state; bh=McrN7tB349pbY0zAo+q4S3SvT9R8MEBmJ6S0asYnjYk=; b=H4UYU/xiMTmpm/As6/2l15+LPHtCPEwxhsaytxQ+APdD8Zscno5I/8KedLdJ5syXzn UsrrApeUYNpotiLax5llVn7LNR/SGFClkOkjfbFUpO9hiOhTqZ6lThVwLwg39T2hCxpC k+ScpAsSQ3kM32nj6xVvcylqIB5tEdJQZPR7Pr2IuzZ7AGu3HNUOt7yAHMBRZRtRH3oA tMyPJ+o78ZNFD98rsAI5+LQ2xRqj0sbhy5JpnnuJh5F3jpUDdR+hmM/ncHa6JB6Z4eRY Q2z6Y4+fhdXeGdB+kDx/cisERQ0VTPZrBYSUGqyH9VWan/aqzAFsPVcPlRLBdNBByk9d HlZg== MIME-Version: 1.0 X-Received: by 10.50.202.6 with SMTP id ke6mr285817igc.30.1363217281771; Wed, 13 Mar 2013 16:28:01 -0700 (PDT) Received: by 10.64.128.170 with HTTP; Wed, 13 Mar 2013 16:28:01 -0700 (PDT) In-Reply-To: <1363203547.24795.GenericBBA@web160905.mail.bf1.yahoo.com> References: <1363203547.24795.GenericBBA@web160905.mail.bf1.yahoo.com> Date: Wed, 13 Mar 2013 16:28:01 -0700 Message-ID: Subject: Re: repair hangs From: Dane Miller To: Wei Zhu Cc: user@cassandra.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Gm-Message-State: ALoCoQnPvIzKdxJUiYERbAv5gksvX7hSg4O1JJLt6ln99R7vywWnjMpjDzTI9RGctnCTfDNFU0T3 X-Virus-Checked: Checked by ClamAV on apache.org On Wed, Mar 13, 2013 at 12:39 PM, Wei Zhu wrote: > My guess would be there is some exception during the repair and your session is aborted. > Here is the code of doing repair: > >https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/AntiEntropyService.java > > looking for > > logger.info > > Compare that with your log file, it should give you a rough idea in which stage repaired died. Thanks for the link to the source. That's a little hard to grok, but your suggestion to examine the logs more thoroughly was helpful. I was able to determine that repair hung due to connection errors during streaming. I'll include log snippets below, but this leads me to other more important questions... 1. is this a nodetool bug? is there any way to propagate the java.io.IOException back to nodetool? 2. network problems on EC2, I'm shocked! are there recommended network settings for EC2? Dane Here are the relevant logs showing (A) repair progress, and (B) java.io.IOExceptions (A) repair progress INFO [Thread-5314] 2013-03-11 23:29:28,866 StorageService.java (line 2364) Starting repair command #9, repairing 1 ranges for keyspace OpsCenter INFO [AntiEntropySessions:13] 2013-03-11 23:29:28,867 AntiEntropyService.java (line 652) [repair #84e86020-8aa3-11e2-abb2-17112e360b9a] new session: will sync /10.34.37.195, /10.82.233.59 on range (0,28356863910078205288614550619314017621] for OpsCenter.[events, rollups60, settings, pdps, rollups86400, events_timeline, rollups300, rollups7200] INFO [Thread-5320] 2013-03-11 23:29:29,198 AntiEntropyService.java (line 765) [repair #84e86020-8aa3-11e2-abb2-17112e360b9a] events is fully synced (7 remaining column family to sync for this session) INFO [AntiEntropyStage:1] 2013-03-11 23:38:02,198 AntiEntropyService.java (line 765) [repair #84e86020-8aa3-11e2-abb2-17112e360b9a] settings is fully synced (6 remaining column family to sync for this session) INFO [AntiEntropyStage:1] 2013-03-11 23:38:02,617 AntiEntropyService.java (line 765) [repair #84e86020-8aa3-11e2-abb2-17112e360b9a] pdps is fully synced (5 remaining column family to sync for this session) INFO [Streaming to /10.82.233.59:34] 2013-03-11 23:38:12,491 AntiEntropyService.java (line 765) [repair #84e86020-8aa3-11e2-abb2-17112e360b9a] rollups86400 is fully synced (4 remaining column family to sync for this session) INFO [Streaming to /10.82.233.59:36] 2013-03-11 23:39:55,886 AntiEntropyService.java (line 765) [repair #84e86020-8aa3-11e2-abb2-17112e360b9a] rollups7200 is fully synced (3 remaining column family to sync for this session) (B) java.io.IOException # grep -A1 ERROR /var/log/cassandra/system.log.2 ERROR [Streaming to /10.82.233.59:34] 2013-03-11 23:38:12,654 CassandraDaemon.java (line 132) Exception in thread Thread[Streaming to /10.82.233.59:34,5,main] java.lang.RuntimeException: java.io.IOException: Connection reset by peer -- ERROR [Streaming to /10.82.233.59:35] 2013-03-11 23:38:12,692 CassandraDaemon.java (line 132) Exception in thread Thread[Streaming to /10.82.233.59:35,5,main] java.lang.RuntimeException: java.io.IOException: Broken pipe -- ERROR [Streaming to /10.82.233.59:36] 2013-03-11 23:39:55,932 CassandraDaemon.java (line 132) Exception in thread Thread[Streaming to /10.82.233.59:36,5,main] java.lang.RuntimeException: java.io.IOException: Broken pipe