Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (athena.apache.org: domain of jdcryans@gmail.com designates
 209.85.218.41 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:sender:in-reply-to:references:date
         :x-google-sender-auth:message-id:subject:from:to:content-type
         :content-transfer-encoding;
        b=c3w/27j1i21+CzSusGrM5onrTnpKfNFOxkr1NM62YkZdoFRfNvGR1/nTput7wRqgQk
         HOWnSO8JbO+GAj1gWxvyGfS67pBxta+cbAaygPVYiayRGfZZp9XNgCNqHihIFHS2u4c2
         vFLFKb57qZPoc1MyH6mayhMxEZJYoFoH7dNWA=
MIME-Version: 1.0
Sender: jdcryans@gmail.com
In-Reply-To: <BANLkTi=SqiscTDND0_pZRm00YpV5YoDGgQ@mail.gmail.com>
References: <BANLkTi=SqiscTDND0_pZRm00YpV5YoDGgQ@mail.gmail.com>
Date: Thu, 19 May 2011 14:22:04 -0700
Message-ID: <BANLkTin3SPjEhz9ry6eyMoAdpLfS+ARMFw@mail.gmail.com>
Subject: Re: hbase master retries to RS/DN
From: Jean-Daniel Cryans <jdcryans@apache.org>
To: user@hbase.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

The config and the retries you pasted are unrelated.

The former controls the number of retries when regions are moving and
the client must query .META. or -ROOT-

The latter is the Hadoop RPC client timeout and looking at the code
the config is ipc.client.connect.max.retries from
https://github.com/apache/hadoop/blob/branch-0.20/src/core/org/apache/hadoo=
p/ipc/Client.java#L631

J-D

On Thu, May 19, 2011 at 11:46 AM, Jack Levin <magnito@gmail.com> wrote:
> Hello, we have a situation when when RS/DN crashes hard, master is
> very slow to recover, we notice that it waits on these log lines:
> 2011-05-19 11:20:57,766 INFO org.apache.hadoop.ipc.Client: Retrying
> connect to server: /10.103.7.22:50020. Already tried 0 time(s).
> 2011-05-19 11:20:58,767 INFO org.apache.hadoop.ipc.Client: Retrying
> connect to server: /10.103.7.22:50020. Already tried 1 time(s).
> 2011-05-19 11:20:59,768 INFO org.apache.hadoop.ipc.Client: Retrying
> connect to server: /10.103.7.22:50020. Already tried 2 time(s).
> 2011-05-19 11:21:00,768 INFO org.apache.hadoop.ipc.Client: Retrying
> connect to server: /10.103.7.22:50020. Already tried 3 time(s).
> 2011-05-19 11:21:01,769 INFO org.apache.hadoop.ipc.Client: Retrying
> connect to server: /10.103.7.22:50020. Already tried 4 time(s).
> 2011-05-19 11:21:02,769 INFO org.apache.hadoop.ipc.Client: Retrying
> connect to server: /10.103.7.22:50020. Already tried 5 time(s).
> 2011-05-19 11:21:03,770 INFO org.apache.hadoop.ipc.Client: Retrying
> connect to server: /10.103.7.22:50020. Already tried 6 time(s).
> 2011-05-19 11:21:04,771 INFO org.apache.hadoop.ipc.Client: Retrying
> connect to server: /10.103.7.22:50020. Already tried 7 time(s).
> 2011-05-19 11:21:05,771 INFO org.apache.hadoop.ipc.Client: Retrying
> connect to server: /10.103.7.22:50020. Already tried 8 time(s).
> 2011-05-19 11:21:06,772 INFO org.apache.hadoop.ipc.Client: Retrying
> connect to server: /10.103.7.22:50020. Already tried 9 time(s).
>
> This set repeats multiple times for log splits. =A0 So I look around,
> and set this config to be:
>
> =A0<property>
> =A0 =A0<name>hbase.client.retries.number</name>
> =A0 =A0<value>2</value>
> =A0 =A0<description>Maximum retries. =A0Used as maximum for all retryable
> =A0 =A0operations such as fetching of the root region from root region
> =A0 =A0server, getting a cell's value, starting a row update, etc.
> =A0 =A0Default: 10.
> =A0 =A0</description>
> =A0</property>
>
> Unfortunately, next time server died, it made no difference. =A0Is this
> a known issue for 0.89? =A0If so, was it resolved in 0.90.2?
>
> -Jack
>