hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From stack <st...@duboce.net>
Subject Re: ANN: hbase 0.20.0 Release Candidate 2 available for download
Date Wed, 26 Aug 2009 18:01:22 GMT
OK.  Lets sink the RC.  Its gotten too many -1s.  HBASE-1792/3 are bad too.

For the record, I'm +1 on RC2 becoming release.  Its been running here at
pset on 110 nodes for last week or so.  I downloaded it and checked out its
documentation and started it up locally.


On Wed, Aug 26, 2009 at 9:04 AM, Jonathan Gray <jlist@streamy.com> wrote:

> I'm with Andrew.  -1 on RC2.
> I don't see the value in putting 0.20.0 into the wild when there are known
> defects.  Just to release 0.20.1 shortly thereafter saying, this fixes
> important issues so please upgrade immediately.  It's completely acceptable
> to say, lots of people are using RC2 in production and it's fine to move
> forward with... and upgrade to the release when it is available.  Following
> release of 0.20.0, we should all be on a PR kick; blogging, tweeting,
> emailing, reaching out, and talking to everyone we can about the awesome new
> release.  So the initial release itself should be solid.

> The balancing issue is a serious issue as it means if you lose a node and
> it comes back online, or if you add a new node, your cluster will suffer
> some serious reliability and performance issues.  I don't think we should
> consider this rare or fringe, in fact it means you can't do rolling restarts
> properly.
> I experienced this in our running production system and eventually I had to
> keep running the cluster w/o two of my nodes.  If you have a node with a far
> fewer regions than the others, then all new regions go to that
> regionserver... load becomes horribly unbalanced if you have a
> recent-data-bias, with a majority of reads and writes going to a single
> node.  This led to that RS being swamped w/ long GC pauses and generally bad
> cluster stability.  It's a release blocker alone, IMO.  JSharp ran into this
> yesterday which is how we realized it had been uncommitted.
> I *might* be okay with a release of 0.20.0 w/o a fix for HBASE-1784 because
> it is very rare... however failed compactions leading to data loss is pretty
> nasty and we should really try to fix it for release if we squash RC2
> anyways.  This is at least worth putting some effort into over the next few
> days to see if we can reproduce the issue and fix it (by rolling back failed
> compactions properly).  It's better that regions grow to huge sizes because
> compactions fail, thus no splits, rather than complete data loss.
> HBASE-1780 should be fixed and should not be too difficult, but maybe not a
> release blocker.
> HBASE-1794 we'll have to hear from Ryan what the status is of it.
> No one wants to delay release any longer, but the most important thing we
> can do is make sure the release is solid... We can't say that we these open
> issue.
> Also, HDFS-200 testing by Ryan is turning up some great stuff and he has
> had success (creating a table, kill -9ing the RS and DN, and META recovers
> fully and the table still exists... magic!).  If we wait until Monday or so
> to cut RC3 (hopefully with fixes for much of above), then perhaps by the
> time we're ready for release we can also have "official" but experimental
> support for HDFS-200.
> Ryan mentioned if it works sufficiently well he'd like to put it into
> production at supr... and I feel the same here at streamy.  If it generally
> works, we'll want to put it into production as the current data loss story
> is really the only frightening thing left :)
> JG
> Andrew Purtell wrote:
>> There is a lot riding on getting this release right. There have been some
>> serious bugs unearthed since 0.20.0 RC1. This makes me nervous. I'm not sure
>> I understand the rationale for releasing 0.20.0 now and then 0.20.1 in one
>> week, as opposed to taking the same amount of time to run another RC cycle
>> to produce a 0.20.0 without bad known defects. What is the benefit?
>>    HBASE-1794: Recovered data still seems missing until compaction, which
>> might not happen for 24 hours. Seems like a fix is already known?
>>    HBASE-1780: Data loss, known fix.
>>    HBASE-1784: Data loss.
>> I'll try to put up a patch/band-aid against at least one of these tonight.
>> HBASE-1784 is really troubling. We should roll back a failed compaction,
>> not vaporize data. -1 on those grounds alone.
>>    - Andy
>> ________________________________
>> From: stack <stack@duboce.net>
>> To: hbase-dev@hadoop.apache.org
>> Sent: Wednesday, August 26, 2009 4:21:33 PM
>> Subject: Re: ANN: hbase 0.20.0 Release Candidate 2 available for download
>> It will take a week or so to roll a new RC and to test and vote on it.
>> Why not let out RC2 as 0.20.0 and do 0.20.1 within the next week or so?
>> The balancing issue happens when you new node online only.  Usually
>> balancing ain't bad.
>> The Mathias issue is bad but still being investigated.
>> Andrew?
>> St.Ack
>> On Wed, Aug 26, 2009 at 1:04 AM, Mathias Herberts <
>> mathias.herberts@gmail.com> wrote:
>>  On Mon, Aug 24, 2009 at 16:51, Jean-Daniel Cryans<jdcryans@apache.org>
>>> wrote:
>>>> +1 I ran it without any problem for a while. I asked Mathias if 1784
>>>> should kill it and he thinks no since it is not deterministic.
>>> Given the latest run I did and the associated logs/investigation which
>>> clearly show that the missing rows is related to failed compactions I
>>> change my mind and now think 1784 should kill this RC.
>>> so -1 for rc2.
>>> Mathias.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message