Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 0459E200BAA for ; Fri, 28 Oct 2016 00:02:11 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 02EC3160B01; Thu, 27 Oct 2016 22:02:11 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id EE856160AF6 for ; Fri, 28 Oct 2016 00:02:09 +0200 (CEST) Received: (qmail 30667 invoked by uid 500); 27 Oct 2016 22:02:08 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 30644 invoked by uid 99); 27 Oct 2016 22:02:08 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 27 Oct 2016 22:02:08 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id BA117C86FD; Thu, 27 Oct 2016 22:02:07 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.98 X-Spam-Level: * X-Spam-Status: No, score=1.98 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HEADER_FROM_DIFFERENT_DOMAINS=0.001, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id iyUHML37YrPP; Thu, 27 Oct 2016 22:02:03 +0000 (UTC) Received: from mail-oi0-f51.google.com (mail-oi0-f51.google.com [209.85.218.51]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id E02D25FAF9; Thu, 27 Oct 2016 22:02:02 +0000 (UTC) Received: by mail-oi0-f51.google.com with SMTP id p136so25983019oic.1; Thu, 27 Oct 2016 15:02:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:from:date:message-id :subject:to:cc; bh=hwzyzEtwRDpfq5Yr9CeL1eAlUQ3xXW5cxS4kHRjkUYg=; b=akoMjfSpgRYXEZJH4ykKRp5GzmnFu/VOLwOJnOLxYvWzoJEBcNafAt+QSwtJk2jT8G wIO8yB5Onae+tCNs47MbUuZ7WnZ/1UZgefmbRCZ4zCMMqTdPT5iQJ1ubB+ynVZSea3tr yMglmZ5xaaMSbL/4Dla2ou9hVB4iG2NrOi0mEATpcnTTVwR+LRT2v2P0lZCNAHnwRuXx NIz+3kkEVae1EB/H6amnZeOmkkLfTbqbyrBqZLTqA+sOjR1gSJdJ8t4gcHNhkTlRZTWs ka+0ysgGx/ayrFGIrisvB3gGDHgM9Q+K5V97S2YC6yMxR/Kw83mHEwU+60Xp6PYbw6ED qEMA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:sender:in-reply-to:references:from :date:message-id:subject:to:cc; bh=hwzyzEtwRDpfq5Yr9CeL1eAlUQ3xXW5cxS4kHRjkUYg=; b=cxmW6ptD+JDDOnMRg6cQTKOUKI5hWxfkUILDsTvXNWYJk+oPBAlQK4QQyo72FNOd4/ qUecBln7XN3Yyc++gOizTBHfsoDSklcsarGnQAtnaShFijrsgTlk8NO84b1JeVHobtRx rQ/WWqWOvbciSXrHGRB/PhbdYyg7+rEqL5loKbW+GrLGmgm4Q331i7+GG4VHM0K0XWPn EOha9CnyENk0K8oz9qjLBJ2VbFxag09F/lKUEw+w6XUP63nRwUsMTtP1S8GIEn2Mufsx CwcqKz4akHfLP3e53Uuk+ktabDW8v+Z8T5wCYsz69o6qFM70BvR4BSBZhuTc0GbqxH6s mxEA== X-Gm-Message-State: ABUngvc3Wl+HJMIAh8jL5WCiD9Uoa949KTUJe08EJl59ZVtuzYo89CRnDaW2MZllME5V2++iV8RbZq+4Qd9o9Q== X-Received: by 10.107.6.161 with SMTP id f33mr8693370ioi.48.1477605715001; Thu, 27 Oct 2016 15:01:55 -0700 (PDT) MIME-Version: 1.0 Sender: saint.ack@gmail.com Received: by 10.36.26.141 with HTTP; Thu, 27 Oct 2016 15:01:54 -0700 (PDT) In-Reply-To: References: <48d57001.fb9c.157d25aeba6.Coremail.allanwin@163.com> <92F03FBA-EDF3-4D90-97D9-AED338AEE1B0@gmail.com> <25896f62.9761.157d68294f8.Coremail.allanwin@163.com> <7316cf0c.2291.157da7e07b2.Coremail.allanwin@163.com> From: Stack Date: Thu, 27 Oct 2016 15:01:54 -0700 X-Google-Sender-Auth: nEM9xeRKZOxQqzaB40Wn4GzZWxs Message-ID: Subject: Re: Re: Re: Re: What way to improve MTTR other than DLR(distributed log replay) To: HBase Dev List Cc: hbase-user Content-Type: multipart/alternative; boundary=001a113de2d8c2c8ea053fdfe37c archived-at: Thu, 27 Oct 2016 22:02:11 -0000 --001a113de2d8c2c8ea053fdfe37c Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On Fri, Oct 21, 2016 at 3:24 PM, Enis S=C3=B6ztutar wr= ote: > A bit late, but let me give my perspective. This can also be moved to jir= a > or dev@ I think. > > DLR was a nice and had pretty good gains for MTTR. However, dealing with > the sequence ids, onlining regions etc and the replay paths proved to be > too difficult in practice. I think the way forward would be to not bring > DLR back, but actually fix long standing log split problems. > > The main gains in DLR is that we do not create lots and lots of tiny file= s, > but instead rely on the regular region flushes, to flush bigger files. Th= is > also helps with handling requests coming from different log files etc. Th= e > only gain that I can think of that you get with DLR, but not with log spl= it > is the online enabling of writes while the recovery is going on. However,= I > think it is not worth having DLR just for this feature. > > And not having to write intermediary files as you note at the start of your paragraph. > Now, what are the problems with Log Split you ask. The problems are > - we create a lot of tiny files > - these tiny files are replayed sequentially when the region is assigne= d > - The region has to replay and flush all data sequentially coming from > all these tiny files. > > Longest pole in MTTR used to be noticing the RS had gone away in the first place. Lets not forget to add this to our list. > In terms of IO, we pay the cost of reading original WAL files, and writin= g > this same amount of data into many small files where the NN overhead is > huge. Then for every region, we do serially sort the data by re-reading t= he > tiny WAL files (recovered edits) and sorting them in memory and flushing > the data. Which means we do 2 times the reads and writes that we should d= o > otherwise. > > The way to solve our log split bottlenecks is re-reading the big table > paper and implement the WAL recovery as described there. > - Implement an HFile format that can contain data from multiple regions. > Something like a concatinated HFile format where each region has its own > section, with its own sequence id, etc. - Implement links to these files where a link can refer to this data. This > is very similar to our ReferenceFile concept. - In each log splitter task, instead of generating tiny WAL files that are > recovered edits, we instead buffer up in memory, and do a sort (this is t= he > same sort of inserting into the memstore) per region. A WAL is ~100 MB on > average, so should not be a big problem to buffer up this. Need to be able to spill. There will be anomalies. > At the end of > the WAL split task, write an hfile containing data from all the regions a= s > described above. Also do a multi NN request to create links in regions to > refer to these files (Not sure whether NN has a batch RPC call or not). > > It does not. So, doing an accounting, I see little difference from what we have now. In new scheme: + We read all WALs as before. + We write about the same (in current scheme, we'd aggregate across WAL so we didn't write a recovered edits file per WAL) though new scheme maybe less since we currently flush after replay of recovered edits so we nail an hfile into the file system that has the recovered edits (but in new scheme, we'll bring on a compaction because we have references which will cause a rewrite of the big hfile into a smaller one...). + Metadata ops are about the same (rather than lots of small recovered edits files instead we write lots of small reference files) ... only current scheme does distributed, paralellized sort and can spill if doesn't fit memory. Am I doing the math right here? Is there big improvement in MTTR? We are offline while we sort and write the big hfile and its references. We might save some because we just open the region after the above is done where now we have open and then replay recovered edits (though we could take writes in current scheme w/ a bit of work). Can we do better? St.Ack > The reason this will be on-par or better than DLR is that, we are only > doing 1 read and 1 write, and the sort is parallelized. The region openin= g > does not have to block on replaying anything or waiting for flush, becaus= e > the data is already sorted and in HFile format. These hfiles will be used > the normal way by adding them to the KVHeaps, etc. When compactions run, = we > will be removing the links to these files using the regular mechanisms. > > Enis > > On Tue, Oct 18, 2016 at 6:58 PM, Ted Yu wrote: > > > Allan: > > One factor to consider is that the assignment manager in hbase 2.0 woul= d > be > > quite different from those in 0.98 and 1.x branches. > > > > Meaning, you may need to come up with two solutions for a single proble= m. > > > > FYI > > > > On Tue, Oct 18, 2016 at 6:11 PM, Allan Yang wrote: > > > > > Hi, Ted > > > These issues I mentioned above(HBASE-13567, HBASE-12743, HBASE-13535, > > > HBASE-14729) are ALL reproduced in our HBase1.x test environment. > Fixing > > > them is exactly what I'm going to do. I haven't found the root cause > yet, > > > but I will update if I find solutions. > > > what I afraid is that, there are other issues I don't know yet. So i= f > > you > > > or other guys know other issues related to DLR, please let me know > > > > > > > > > Regards > > > Allan Yang > > > > > > > > > > > > > > > > > > > > > > > > At 2016-10-19 00:19:06, "Ted Yu" wrote: > > > >Allan: > > > >I wonder how you deal with open issues such as HBASE-13535. > > > >From your description, it seems your team fixed more DLR issues. > > > > > > > >Cheers > > > > > > > >On Mon, Oct 17, 2016 at 11:37 PM, allanwin wrote: > > > > > > > >> > > > >> > > > >> > > > >> Here is the thing. We have backported DLR(HBASE-7006) to our 0.94 > > > >> clusters in production environment(of course a lot of bugs are > fixed > > > and > > > >> it is working well). It is was proven to be a huge gain. When a > large > > > >> cluster crash down, the MTTR improved from several hours to less > than > > a > > > >> hour. Now, we want to move on to HBase1.x, and still we want DLR. > This > > > >> time, we don't want to backport the 'backported' DLR to HBase1.x, > but > > it > > > >> seems like that the community have determined to remove DLR... > > > >> > > > >> > > > >> The DLR feature is proven useful in our production environment, so= I > > > think > > > >> I will try to fix its issues in branch-1.x > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> At 2016-10-18 13:47:17, "Anoop John" wrote= : > > > >> >Agree with ur observation.. But DLR feature we wanted to get > > removed.. > > > >> >Because it is known to have issues.. Or else we need major work = to > > > >> >correct all these issues. > > > >> > > > > >> >-Anoop- > > > >> > > > > >> >On Tue, Oct 18, 2016 at 7:41 AM, Ted Yu > wrote: > > > >> >> If you have a cluster, I suggest you turn on DLR and observe th= e > > > effect > > > >> >> where fewer than half the region servers are up after the crash= . > > > >> >> You would have first hand experience that way. > > > >> >> > > > >> >> On Mon, Oct 17, 2016 at 6:33 PM, allanwin > > wrote: > > > >> >> > > > >> >>> > > > >> >>> > > > >> >>> > > > >> >>> Yes, region replica is a good way to improve MTTR. Specially i= f > > one > > > or > > > >> two > > > >> >>> servers are down, region replica can improve data availability= . > > But > > > >> for big > > > >> >>> disaster like 1/3 or 1/2 region servers shutdown, I think DLR > > still > > > >> useful > > > >> >>> to bring regions online more quickly and with less IO usage. > > > >> >>> > > > >> >>> > > > >> >>> Regards > > > >> >>> Allan Yang > > > >> >>> > > > >> >>> > > > >> >>> > > > >> >>> > > > >> >>> > > > >> >>> > > > >> >>> At 2016-10-17 21:01:16, "Ted Yu" wrote: > > > >> >>> >Here was the thread discussing DLR: > > > >> >>> > > > > >> >>> >http://search-hadoop.com/m/YGbbOxBK2n4ES12&subj=3DRe+ > > > >> >>> DISCUSS+retiring+current+DLR+code > > > >> >>> > > > > >> >>> >> On Oct 17, 2016, at 4:15 AM, allanwin > > wrote: > > > >> >>> >> > > > >> >>> >> Hi, All > > > >> >>> >> DLR can improve MTTR dramatically, but since it have many > bugs > > > like > > > >> >>> HBASE-13567, HBASE-12743, HBASE-13535, HBASE-14729(any more > > I'don't > > > >> know?), > > > >> >>> it was proved unreliable, and has been deprecated almost in al= l > > > >> branches > > > >> >>> now. > > > >> >>> >> > > > >> >>> >> > > > >> >>> >> My question is, is there any other way other than DLR to > > improve > > > >> MTTR? > > > >> >>> 'Cause If a big cluster crashes, It takes a long time to bring > > > regions > > > >> >>> online, not to mention it will create huge pressure on the IOs= . > > > >> >>> >> > > > >> >>> >> > > > >> >>> >> To tell the truth, I still want DLR back, if the community > > don't > > > >> have > > > >> >>> any plan to bring back DLR, I may want to figure out the > problems > > in > > > >> DLR > > > >> >>> and make it working and reliable, Any suggests for that? > > > >> >>> >> > > > >> >>> >> > > > >> >>> >> sincerely > > > >> >>> >> Allan Yang > > > >> >>> > > > >> > > > > > > --001a113de2d8c2c8ea053fdfe37c--