Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5851B48CA for ; Thu, 19 May 2011 18:46:45 +0000 (UTC) Received: (qmail 97140 invoked by uid 500); 19 May 2011 18:46:42 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 97081 invoked by uid 500); 19 May 2011 18:46:42 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 97004 invoked by uid 99); 19 May 2011 18:46:42 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 19 May 2011 18:46:42 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of magnito@gmail.com designates 74.125.83.41 as permitted sender) Received: from [74.125.83.41] (HELO mail-gw0-f41.google.com) (74.125.83.41) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 19 May 2011 18:46:36 +0000 Received: by gwaa12 with SMTP id a12so1366438gwa.14 for ; Thu, 19 May 2011 11:46:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:date:message-id:subject:from:to:cc :content-type; bh=dH0hA9xk15Sdug0H/FH0hAubqxYBZseMaK1f2yF/sIY=; b=Nbzb6Xsd2Y2ViI0SidFV7qVg7BeUZz81fDC9d9LLGGwseXYk4R/oXTA+4InbZy7MmX EVhvEHhd8s8njSr4ReKQCydjsTk7CzLupJ+V89d75VVtzkxQaQAkps6AVkBmGGp0oYbD 6pX0vR0I+qeCyekO+qDJCgvHPrlSwb03d0dEg= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:cc:content-type; b=nu4DqZJuVytNaHErM/D72COOmk9tDUJN0cUTGCY74PJCcvy2dv1uGPzPyebN/f+OE8 FEV1asGwCwAdZ80jdaXlZr0X/KiOhEKBWjeEIAW5nLYEYJFVMByF6OHYqNdlVzRHDQ1h tAswl5Uh+E5tz8A4h/UKbbbimBCIusUUnb/OA= MIME-Version: 1.0 Received: by 10.236.187.74 with SMTP id x50mr3601443yhm.501.1305830775288; Thu, 19 May 2011 11:46:15 -0700 (PDT) Received: by 10.236.110.173 with HTTP; Thu, 19 May 2011 11:46:15 -0700 (PDT) Date: Thu, 19 May 2011 11:46:15 -0700 Message-ID: Subject: hbase master retries to RS/DN From: Jack Levin To: user@hbase.apache.org Cc: sysops@imageshack.us Content-Type: text/plain; charset=ISO-8859-1 Hello, we have a situation when when RS/DN crashes hard, master is very slow to recover, we notice that it waits on these log lines: 2011-05-19 11:20:57,766 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.103.7.22:50020. Already tried 0 time(s). 2011-05-19 11:20:58,767 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.103.7.22:50020. Already tried 1 time(s). 2011-05-19 11:20:59,768 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.103.7.22:50020. Already tried 2 time(s). 2011-05-19 11:21:00,768 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.103.7.22:50020. Already tried 3 time(s). 2011-05-19 11:21:01,769 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.103.7.22:50020. Already tried 4 time(s). 2011-05-19 11:21:02,769 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.103.7.22:50020. Already tried 5 time(s). 2011-05-19 11:21:03,770 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.103.7.22:50020. Already tried 6 time(s). 2011-05-19 11:21:04,771 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.103.7.22:50020. Already tried 7 time(s). 2011-05-19 11:21:05,771 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.103.7.22:50020. Already tried 8 time(s). 2011-05-19 11:21:06,772 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.103.7.22:50020. Already tried 9 time(s). This set repeats multiple times for log splits. So I look around, and set this config to be: hbase.client.retries.number 2 Maximum retries. Used as maximum for all retryable operations such as fetching of the root region from root region server, getting a cell's value, starting a row update, etc. Default: 10. Unfortunately, next time server died, it made no difference. Is this a known issue for 0.89? If so, was it resolved in 0.90.2? -Jack