Return-Path: X-Original-To: apmail-hbase-dev-archive@www.apache.org Delivered-To: apmail-hbase-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C004B105CF for ; Tue, 3 Dec 2013 22:04:00 +0000 (UTC) Received: (qmail 55337 invoked by uid 500); 3 Dec 2013 22:04:00 -0000 Delivered-To: apmail-hbase-dev-archive@hbase.apache.org Received: (qmail 55270 invoked by uid 500); 3 Dec 2013 22:04:00 -0000 Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hbase.apache.org Delivered-To: mailing list dev@hbase.apache.org Received: (qmail 55262 invoked by uid 99); 3 Dec 2013 22:03:59 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Dec 2013 22:03:59 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jon@cloudera.com designates 209.85.220.174 as permitted sender) Received: from [209.85.220.174] (HELO mail-vc0-f174.google.com) (209.85.220.174) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 03 Dec 2013 22:03:55 +0000 Received: by mail-vc0-f174.google.com with SMTP id id10so10562955vcb.33 for ; Tue, 03 Dec 2013 14:03:35 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=HRnQKrz4q8sps8slBGAaxR+RFEKu/KPobylhAz/q2kE=; b=PykuoAQFS5lZo++eiskF7jHENm6o3IwDQlO82i7ZnlzFb577GQ5RM67dSPb+tIYD4R PMqddqh8WFTLX3qnCLFzYrAv2pl2pFplydMVv7OnkKYzRFmjH8hPv/XamHrxETvfXvAm ReE/Nl8OfysE/1QI7QJLQWIRpggLIu9Ui5SLtLaRqJiIgJGe/+CO8om4XiQ89guv8Srs rgyCSXDQ2PYQCI7k9IrUu/MstbUHP5+7xJz1BV89nkQku0JRvuFAKAHV+xmHQ9KCk7fe 98R73teNJxMSyRc0/d9F7NQ15QtlrnmG76YNMsZVotruEyAV1AjDOJboybjOOCaVBo3E qbvA== X-Gm-Message-State: ALoCoQnEyFI6HOlg0lyrnQx1G1XTWhWtW9hLfOSr1RIDwHyHV+4fqoqVMDjLs81aZWcl2IVr4b55 X-Received: by 10.220.164.202 with SMTP id f10mr6588417vcy.25.1386108214837; Tue, 03 Dec 2013 14:03:34 -0800 (PST) MIME-Version: 1.0 Received: by 10.58.77.6 with HTTP; Tue, 3 Dec 2013 14:03:14 -0800 (PST) In-Reply-To: References: From: Jonathan Hsieh Date: Tue, 3 Dec 2013 14:03:14 -0800 Message-ID: Subject: Re: [Shadow Regions / Read Replicas ] Wal per region? To: "dev@hbase.apache.org" Content-Type: multipart/alternative; boundary=001a11c1e980c4551604eca877aa X-Virus-Checked: Checked by ClamAV on apache.org --001a11c1e980c4551604eca877aa Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable On Tue, Dec 3, 2013 at 11:42 AM, Enis S=F6ztutar wrote= : > On Mon, Dec 2, 2013 at 10:20 PM, Jonathan Hsieh wrote: > > > > Deveraj: > > > Jonathan Hsieh, WAL per region (WALpr) would give you the locality (a= nd > > hence HDFS short > > > circuit) of reads if you were to couple it with the favored nodes. Th= e > > cost is of course more WAL > > > files... In the current situation (no WALpr) it would create quite so= me > > traffic cross machine, no? > > > > I think we all agree that wal per region isn't efficient on today's > > spinning hard drive world where we are limited to a relatively low budg= et > > or seeks (though may be better in the future with SSD's). > > > > WALpr makes sense in fully SSD world and if hdfs had journaling for write= s. > I don't think anybody > is working on this yet. what do you mean by journaling for writes? do you mean where sync operations update length at the nn on every call? > Full SSD clusters are already in place (pinterest > for example), so I > think having WALpr as a pluggable implementation makes sense. HBase shoul= d > work with both > WAL-per-regionserver (or multi) or WAL-per-region. > > > I agree here. > > > > With this in mind, I actually I making the case that we would group the > all > > the regions from RS-A onto the same set of preferred regions servers. > This > > way we only need to have one or two other RS's tailing the RS. > > > > So for example, if region X, Y and Z were on RS-A and its hlog, the > shadow > > region memstores for X, Y, and Z would be assigned to the same one or t= wo > > other RSs. Ideally this would be where the HLog files replicas have > > locality (helped by favored nodes/block affinity). Doing this, we hold > the > > number of readers on the active hlogs to a constant number, do not add > any > > new cross machine traffic (though tailing currently has costs on the NN= ). > > > > One inefficiency we have is that if there is a single log per RS, we en= d > up > > reading all the logs to tables that may not have the shadow feature > > enabled. However, with HBase multi-wals coming, one strategy is to sha= rd > > wals to a number on the order of the number of disks on a machine (12-2= 4 > > these days). I think the a wal per namespaces (this could be used to > have > > a wal per table) of the hlog would make sense. This way of shardind th= e > > hlog would reduce the amount of reading of irrelevant log entries on a > log > > tailing scheme. It would have the added benefit of reducing the log > > splitting work reducing MTTR and allowing for recovery priorities if th= e > > primaries and shadows also go down. (this is an generalization of the > > separate out the META into a separate log idea). > > > > Jon. > > > > -- > > // Jonathan Hsieh (shay) > > // Software Engineer, Cloudera > > // jon@cloudera.com > > > --=20 // Jonathan Hsieh (shay) // Software Engineer, Cloudera // jon@cloudera.com --001a11c1e980c4551604eca877aa--