Return-Path: X-Original-To: apmail-hbase-dev-archive@www.apache.org Delivered-To: apmail-hbase-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2CF2010362 for ; Tue, 3 Dec 2013 06:01:34 +0000 (UTC) Received: (qmail 14280 invoked by uid 500); 3 Dec 2013 06:01:31 -0000 Delivered-To: apmail-hbase-dev-archive@hbase.apache.org Received: (qmail 14212 invoked by uid 500); 3 Dec 2013 06:01:31 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Received: (qmail 14204 invoked by uid 99); 3 Dec 2013 06:01:30 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Dec 2013 06:01:30 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jon@cloudera.com designates 209.85.220.181 as permitted sender) Received: from [209.85.220.181] (HELO mail-vc0-f181.google.com) (209.85.220.181) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Dec 2013 06:01:25 +0000 Received: by mail-vc0-f181.google.com with SMTP id ks9so9113456vcb.40 for ; Mon, 02 Dec 2013 22:01:05 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:from:date:message-id:subject:to :content-type; bh=wCYBMXXCz8yBhH8yqIbLvZNkuXzrWihSwzQbUwuNv4Y=; b=lOxRcYfH03TPB2LtGIH9j4CAQhfkJQdMAIBm/N5lGvys5Ec+On8mq6bmvmkUwrzmoY K+tlYsvw3N8vXj9VZVtjsr+gTRN9xJFd8jna14S2AxEyLeDSCFoxnvmHQOUd/CrkoUpQ VSS8uT8+TsgU29xs8ZLPi7QcIQia2AhGcOsZ2AWG7YLfa8/DlpzelkyGuSM9mAQO2C1u UnxzkSB3Lqv3o1S5JgVoborWwqHnN1VueSRJ3Vr+zaRkyr0L0Sx4mT8jLpunnAUsu91n m3COH8tZKWjuyH0bSatbYAH9jEVSuhqZwHarvvk7bVBlq4+LmpdapxMNhPaMcG2qHsEW +mSw== X-Gm-Message-State: ALoCoQlqXtV8MJQoOmJxbvPdFdQbINkNtgmFuHcjjELqN09IvUaKNKvWr95AmaSo/+sbz3J+UMX4 X-Received: by 10.220.16.73 with SMTP id n9mr4780351vca.24.1386050464816; Mon, 02 Dec 2013 22:01:04 -0800 (PST) MIME-Version: 1.0 Received: by 10.58.77.6 with HTTP; Mon, 2 Dec 2013 22:00:44 -0800 (PST) From: Jonathan Hsieh Date: Mon, 2 Dec 2013 22:00:44 -0800 Message-ID: Subject: Re: [Shadow Regions / Read Replicas ] Block Affinity To: "dev@hbase.apache.org" Content-Type: multipart/alternative; boundary=001a11c3bea098d87704ec9b05ab X-Virus-Checked: Checked by ClamAV on apache.org --001a11c3bea098d87704ec9b05ab Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable > Enis: > I was trying to refer to not having co-location constraints for secondary replicas whose primaries are hosted by the same > RS. For example, if R1(replica=3D0), and R2(replica=3D0) are hosted on RS= 1, R1(replica=3D1) and R2(replica=3D1) can be hosted by RS2 > and RS3 respectively. This can definitely use the hdfs block affinity work though. This particular example doesn't have enough to tease out the different ideal situations. Hopefully this will help: We have RS-A hosting regions X and Y. With affinity groups let's say RS-A's logs are written to RS-A, RS-L, and RS-M. Let's also say that X is written to RS-A, RS-B and RS-C and Y is to RS-A, RS-D, RS-E. >> Jon: >> However, I don't think we get into a situation where all RS's must read all other RS's logs =96 we only need to have the shadows RS's to read the primary RS's log. > Enis: > I am assuming a random distribution of secondary regions per above. In this case, for replication=3D2, a region server will > have half of it's regions in primary and the other in secondary mode. For all the regions in the secondary mode, it has to > tail the logs of the rs where the primary is hosted. However, since there is no co-location guarantee, the primaries are > also randomly distributed. For n secondary regions, and m region servers, you will have to tail the logs of most of the RSs > if n > m with a high probability (I do not have the smarts to calculate the exact probability) For hi-availability stale reads replicas (read replicas) it seems best to assign the secondary regions assigned to the rs's where the HFiles are hosted. Thus this approach would want to assign shadow regions like this (this is the "random distribution of sedonary regions): * X on RS-A(rep=3D0), RS-B(rep=3D1), and RS-C(rep=3D2); and * Y on RS-A(rep=3D0), RS-D(rep=3D1), and RS-E(rep=3D2). For the most efficient consistent read-recovery (shadow regions/memstores), it would make sense to have them assigned to the rs's where the Hlogs are local. Thus this approach would want to assign shadow regions for regions X, Y, and Z on RS-L and RS-M. A simple optimal solution for both read replicas and shadow regions would be to assign the regions and the HLog to the same set of machines so that the RS's for the logs and region x, y, and z hosted are on the same machines -- let's say RS-A, RS-H, and RS-I. This has some non-optimal balancing ramifications upon machine failure -- the work of RS-A would be split between to RS-H and RS-I. A more complex solution for both would be to choose machines for the purpose they are best suited for. Read replicas are hosted on their respective machines, and shadow region memstores on the hlog's rs's. Promotion becomes a more complicated dance where upon RS-A's failure, we have the log tailing shadow region catchup and perform a flush of the affected memstores to the appropriate hfds affinity group/favored nodes. So the shadow memstore for region X would flush the hfile to A,B,C and region Y to A,D,E. Then the read replicas would be promoted (close secondary, open as primary) based on where the regions/hfiles's affinity group. This feels likes an optimization done one the 2rd or 3rd rev. Jon --=20 // Jonathan Hsieh (shay) // Software Engineer, Cloudera // jon@cloudera.com --001a11c3bea098d87704ec9b05ab--