hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Hsieh <...@cloudera.com>
Subject Re: ANNOUNCE: hbase 0.96.1.1rc0 release candidate is available for download.
Date Thu, 19 Dec 2013 19:45:02 GMT
Thanks Ted.

I'm looked at the history of that code and believe that this race and test
case only causes a perf problem (we may roll extra times) and has been
around since 0.90.5.

I'm +1'ing this for release also.


On Thu, Dec 19, 2013 at 9:45 AM, Ted Yu <yuzhihong@gmail.com> wrote:

> There used to be some comment around low replication checking:
>
>       // TODO: preserving the old behavior for now, but this check is
> strange. It's not      //       protected by any locks here, so for
> all we know rolling locks might start      //       as soon as we
> enter the "if". Is this best-effort optimization check?      if
> (!this.logRollRunning) {
>         checkLowReplication();
>
>
> This means that checkLowReplication() may be running when
> FSHLog#rollWriter() is also running - hence the race.
> That is why checkLowReplication() is now put under reentrant lock.
>
> w.r.t. rolling new RC, I would leave the decision making up to you.
>
> Low replication check has several parameters. One of which is the
> following:
>
>     this.lowReplicationRollLimit = conf.getInt(
>
>         "hbase.regionserver.hlog.lowreplication.rolllimit", 5);
>
> When consecutive log rolls exceeds the above conf value, lowReplicationRoll
> is disabled.
>
> If replica count goes down because of issue in Datanodes, the above limit
> would be reached ultimately.
>
> Cheers
>
> On Thu, Dec 19, 2013 at 9:21 AM, Jonathan Hsieh <jon@cloudera.com> wrote:
>
> > Ted,
> >
> > My question really boils down this this -- can we have a data loss if we
> > don't take in HBASE-10142 or not?  It will take me a little more time
> > reverse engineer and convince myself one way or another.    If we can
> lose
> > data, then I'll roll take the port and do another rc.  If we don't (we
> are
> > just less performant) then since we have 4 binding +1's other than me
> I'll
> > release.
> >
> > I either case, I want to understand how it fixes the problem. I can see
> the
> > changes e.g. the re-entrant lock, and the shifting out of
> > checkLowReplication call, but I need to figure out why the changes were
> > necessary. The main substantive change is that the lock is taken around
> the
> > checkLowReplication call but I haven't put together why this fixes the
> > problem.  Can you shed more light (and ideally lay out why the changes
> fix
> > the problem in jira? -- maybe a bad multithread trace that is fixed by
> the
> > new lock?)
> >
> > Thanks,
> > Jon.
> >
> >
> > On Thu, Dec 19, 2013 at 9:01 AM, Ted Yu <yuzhihong@gmail.com> wrote:
> >
> > > Jon:
> > > If Stack gives the greenlight, I can certainly port it to 0.96 branch.
> > >
> > > Cheers
> > >
> > >
> > > On Thu, Dec 19, 2013 at 8:39 AM, Jonathan Hsieh <jon@cloudera.com>
> > wrote:
> > >
> > > > When I run the test standalone,  and didn't have an failure in
> > > 0.96.1.1rc0
> > > > or 0.96.1. When I ran the whole suite, I ran into exactly the same
> > > failure
> > > > on 0.96.1.1 (currnetly testing full suite from 0.96.1 src tar ball)
> > > >
> > > > I've spent some time reviewing HBASE-10142, There are some non-test
> > code
> > > > modifications still trying to determine if it is a serious problem or
> > not
> > > > on that side.   Ted, is there a reason why this wasn't ported to the
> > 0.96
> > > > branch?
> > > >
> > > > Jon.
> > > >
> > > >
> > > > On Wed, Dec 18, 2013 at 9:00 PM, Jean-Marc Spaggiari <
> > > > jean-marc@spaggiari.org> wrote:
> > > >
> > > > > Typo is because I have no done a cut&past ;)
> > > > >
> > > > > With cut&past: mvn test -PrunAllTests
> > > -Dsurefire.secondPartThreadCount=8
> > > > >
> > > > > On the last run, error is:
> > > > > Failed tests:
> > > > >
> > > > >
> > > >
> > >
> >
> testLogRollOnDatanodeDeath(org.apache.hadoop.hbase.regionserver.wal.TestLogRolling):
> > > > > LowReplication Roller should've been disabled, current
> replication=1
> > > > >
> > > > > I did not keep the errors from the previous runs...
> > > > >
> > > > >
> > > > > 2013/12/18 Ted Yu <yuzhihong@gmail.com>
> > > > >
> > > > > > What error(s) did you see ?
> > > > > >
> > > > > > There was a typo in this def ('s' between D and u):
> > > > > > -Dsurefire.secondPartThreadCount
> > > > > >
> > > > > > You can lower the thread count: in trunk build, value of 2 is
> used.
> > > > > >
> > > > > > Cheers
> > > > > >
> > > > > >
> > > > > > On Wed, Dec 18, 2013 at 8:45 PM, Jean-Marc Spaggiari <
> > > > > > jean-marc@spaggiari.org> wrote:
> > > > > >
> > > > > > > I tried multiple times over many hours to run:
> > > > > > > mvn test -PrunallTests -Dusefire.secondPartThreadCount=8
> > > > > > >
> > > > > > > On a local machine using the src jar, with no success.
I might
> be
> > > > > missing
> > > > > > > something... I will investigate so I will be able to provide
> > better
> > > > > > > feedback for 0.96.2...
> > > > > > >
> > > > > > > Sorry about that.
> > > > > > >
> > > > > > >
> > > > > > > 2013/12/18 Enis Söztutar <enis.soz@gmail.com>
> > > > > > >
> > > > > > > > +1.
> > > > > > > >
> > > > > > > > - downloaded the artifacts
> > > > > > > > - checked checksums
> > > > > > > > - checked sigs
> > > > > > > > - checked hadoop libs in h1 / h2
> > > > > > > > - checked directory layouts
> > > > > > > > - run local cluster
> > > > > > > > - run smoke tests with shell on the artifacts
> > > > > > > > - run tests locally:
> > > > > > > >   -- bin/hbase org.apache.hadoop.hbase.util.LoadTestTool
> -write
> > > > > > 10:10:100
> > > > > > > > -num_keys 1000000 -read 100:30
> > > > > > > >   -- bin/hbase
> > > > > > "org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList
> > > > > > > > Loop 1 1 3000000 /tmp/biglinkedlist 1"
> > > > > > > >
> > > > > > > >
> > > > > > > > On Wed, Dec 18, 2013 at 12:45 PM, Stack <stack@duboce.net>
> > > wrote:
> > > > > > > >
> > > > > > > > > +1
> > > > > > > > >
> > > > > > > > > Downloaded, unbundled, checked layout, and ran
it.  Browsed
> > the
> > > > mvn
> > > > > > > > > artifacts.
> > > > > > > > >
> > > > > > > > > St.Ack
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Tue, Dec 17, 2013 at 2:23 PM, Jonathan Hsieh
<
> > > > jon@cloudera.com>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > This is a quick-fix release directly off
of the 0.96.1
> > > release.
> > > > > > It
> > > > > > > > can
> > > > > > > > > be
> > > > > > > > > > downloaded here:
> > > > > > > > > >
> > > > > > > > > > http://people.apache.org/~jmhsieh/hbase-0.96.1.1rc0/
> > > > > > > > > >
> > > > > > > > > > The maven staging repo is here:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > >
> > > >
> https://repository.apache.org/content/repositories/orgapachehbase-056/
> > > > > > > > > >
> > > > > > > > > > There is only one jira'ed patch in this
release,
> > HBASE-10188
> > > > [1].
> > > > > > >  This
> > > > > > > > > > fixes an API incompatibility introduced
between 0.96.0
> and
> > > > > 0.96.1.
> > > > > > > > Other
> > > > > > > > > > changes include updates to CHANGES.txt (to
include that
> > > jira),
> > > > > and
> > > > > > > > > pom.xml
> > > > > > > > > > (naming the release 0.96.1.1).
> > > > > > > > > >
> > > > > > > > > > As such, we'll have an abridged testing
and voting
> period.
> > > > > Please
> > > > > > > > have
> > > > > > > > > a
> > > > > > > > > > quick look and vote +1/-1 by 12/18/13 23:59
pacific time.
> > If
> > > > this
> > > > > > > > passes
> > > > > > > > > > we'll take down 0.96.1.
> > > > > > > > > >
> > > > > > > > > > Please check the release mechanics -- this
is my first
> > > attempt
> > > > at
> > > > > > an
> > > > > > > > > hbase
> > > > > > > > > > release.
> > > > > > > > > >
> > > > > > > > > > Thanks!
> > > > > > > > > >
> > > > > > > > > > [1] https://issues.apache.org/jira/browse/HBASE-10188
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > // Jonathan Hsieh (shay)
> > > > > > > > > > // Software Engineer, Cloudera
> > > > > > > > > > // jon@cloudera.com
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > // Jonathan Hsieh (shay)
> > > > // Software Engineer, Cloudera
> > > > // jon@cloudera.com
> > > >
> > >
> >
> >
> >
> > --
> > // Jonathan Hsieh (shay)
> > // Software Engineer, Cloudera
> > // jon@cloudera.com
> >
>



-- 
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// jon@cloudera.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message