Return-Path: Delivered-To: apmail-hadoop-hbase-dev-archive@minotaur.apache.org Received: (qmail 96267 invoked from network); 26 Aug 2009 16:04:09 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 26 Aug 2009 16:04:09 -0000 Received: (qmail 50850 invoked by uid 500); 26 Aug 2009 16:04:08 -0000 Delivered-To: apmail-hadoop-hbase-dev-archive@hadoop.apache.org Received: (qmail 50809 invoked by uid 500); 26 Aug 2009 16:04:08 -0000 Mailing-List: contact hbase-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hbase-dev@hadoop.apache.org Delivered-To: mailing list hbase-dev@hadoop.apache.org Received: (qmail 50799 invoked by uid 99); 26 Aug 2009 16:04:08 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 26 Aug 2009 16:04:08 +0000 X-ASF-Spam-Status: No, hits=-1.0 required=10.0 tests=RCVD_IN_DNSWL_LOW,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jlist@streamy.com designates 72.34.249.3 as permitted sender) Received: from [72.34.249.3] (HELO mail.streamy.com) (72.34.249.3) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 26 Aug 2009 16:03:59 +0000 Received: from [192.168.249.50] (static-98-112-71-211.lsanca.dsl-w.verizon.net [98.112.71.211]) by ns1.streamy.com (8.13.1/8.13.1) with ESMTP id n7QG3a7c017175 for ; Wed, 26 Aug 2009 09:03:36 -0700 Message-ID: <4A955D14.4070102@streamy.com> Date: Wed, 26 Aug 2009 09:04:36 -0700 From: Jonathan Gray User-Agent: Thunderbird 2.0.0.23 (Windows/20090812) MIME-Version: 1.0 To: hbase-dev@hadoop.apache.org Subject: Re: ANN: hbase 0.20.0 Release Candidate 2 available for download References: <7c962aed0908172252q5eab02fax7dfada4ee99dcb32@mail.gmail.com> <260076.10566.qm@web65516.mail.ac4.yahoo.com> <92c4d8c10908240110s248a5b24of42abd47e49123c2@mail.gmail.com> <31a243e70908240751i7a70cd9dx67b0c91551241247@mail.gmail.com> <1c5747850908260104r6bd8a32al7ef7959194e678d7@mail.gmail.com> <7c962aed0908260721h661311r4f535cc64ba8ba0d@mail.gmail.com> <858469.97089.qm@web65516.mail.ac4.yahoo.com> In-Reply-To: <858469.97089.qm@web65516.mail.ac4.yahoo.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on ns1.streamy.com X-Virus-Checked: Checked by ClamAV on apache.org X-Old-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=failed version=3.2.5 I'm with Andrew. -1 on RC2. I don't see the value in putting 0.20.0 into the wild when there are known defects. Just to release 0.20.1 shortly thereafter saying, this fixes important issues so please upgrade immediately. It's completely acceptable to say, lots of people are using RC2 in production and it's fine to move forward with... and upgrade to the release when it is available. Following release of 0.20.0, we should all be on a PR kick; blogging, tweeting, emailing, reaching out, and talking to everyone we can about the awesome new release. So the initial release itself should be solid. The balancing issue is a serious issue as it means if you lose a node and it comes back online, or if you add a new node, your cluster will suffer some serious reliability and performance issues. I don't think we should consider this rare or fringe, in fact it means you can't do rolling restarts properly. I experienced this in our running production system and eventually I had to keep running the cluster w/o two of my nodes. If you have a node with a far fewer regions than the others, then all new regions go to that regionserver... load becomes horribly unbalanced if you have a recent-data-bias, with a majority of reads and writes going to a single node. This led to that RS being swamped w/ long GC pauses and generally bad cluster stability. It's a release blocker alone, IMO. JSharp ran into this yesterday which is how we realized it had been uncommitted. I *might* be okay with a release of 0.20.0 w/o a fix for HBASE-1784 because it is very rare... however failed compactions leading to data loss is pretty nasty and we should really try to fix it for release if we squash RC2 anyways. This is at least worth putting some effort into over the next few days to see if we can reproduce the issue and fix it (by rolling back failed compactions properly). It's better that regions grow to huge sizes because compactions fail, thus no splits, rather than complete data loss. HBASE-1780 should be fixed and should not be too difficult, but maybe not a release blocker. HBASE-1794 we'll have to hear from Ryan what the status is of it. No one wants to delay release any longer, but the most important thing we can do is make sure the release is solid... We can't say that we these open issue. Also, HDFS-200 testing by Ryan is turning up some great stuff and he has had success (creating a table, kill -9ing the RS and DN, and META recovers fully and the table still exists... magic!). If we wait until Monday or so to cut RC3 (hopefully with fixes for much of above), then perhaps by the time we're ready for release we can also have "official" but experimental support for HDFS-200. Ryan mentioned if it works sufficiently well he'd like to put it into production at supr... and I feel the same here at streamy. If it generally works, we'll want to put it into production as the current data loss story is really the only frightening thing left :) JG Andrew Purtell wrote: > There is a lot riding on getting this release right. There have been some serious bugs unearthed since 0.20.0 RC1. This makes me nervous. I'm not sure I understand the rationale for releasing 0.20.0 now and then 0.20.1 in one week, as opposed to taking the same amount of time to run another RC cycle to produce a 0.20.0 without bad known defects. What is the benefit? > > HBASE-1794: Recovered data still seems missing until compaction, which might not happen for 24 hours. Seems like a fix is already known? > HBASE-1780: Data loss, known fix. > HBASE-1784: Data loss. > > I'll try to put up a patch/band-aid against at least one of these tonight. > > HBASE-1784 is really troubling. We should roll back a failed compaction, not vaporize data. -1 on those grounds alone. > > - Andy > > > > > ________________________________ > From: stack > To: hbase-dev@hadoop.apache.org > Sent: Wednesday, August 26, 2009 4:21:33 PM > Subject: Re: ANN: hbase 0.20.0 Release Candidate 2 available for download > > It will take a week or so to roll a new RC and to test and vote on it. > > Why not let out RC2 as 0.20.0 and do 0.20.1 within the next week or so? > > The balancing issue happens when you new node online only. Usually > balancing ain't bad. > > The Mathias issue is bad but still being investigated. > > Andrew? > > St.Ack > > > On Wed, Aug 26, 2009 at 1:04 AM, Mathias Herberts < > mathias.herberts@gmail.com> wrote: > >> On Mon, Aug 24, 2009 at 16:51, Jean-Daniel Cryans >> wrote: >>> +1 I ran it without any problem for a while. I asked Mathias if 1784 >>> should kill it and he thinks no since it is not deterministic. >> Given the latest run I did and the associated logs/investigation which >> clearly show that the missing rows is related to failed compactions I >> change my mind and now think 1784 should kill this RC. >> >> so -1 for rc2. >> >> Mathias. >> > > > >