hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: ANNOUNCE: hbase 0.96.1.1rc0 release candidate is available for download.
Date Thu, 19 Dec 2013 17:45:14 GMT
There used to be some comment around low replication checking:

      // TODO: preserving the old behavior for now, but this check is
strange. It's not      //       protected by any locks here, so for
all we know rolling locks might start      //       as soon as we
enter the "if". Is this best-effort optimization check?      if
(!this.logRollRunning) {
        checkLowReplication();


This means that checkLowReplication() may be running when
FSHLog#rollWriter() is also running - hence the race.
That is why checkLowReplication() is now put under reentrant lock.

w.r.t. rolling new RC, I would leave the decision making up to you.

Low replication check has several parameters. One of which is the following:

    this.lowReplicationRollLimit = conf.getInt(

        "hbase.regionserver.hlog.lowreplication.rolllimit", 5);

When consecutive log rolls exceeds the above conf value, lowReplicationRoll
is disabled.

If replica count goes down because of issue in Datanodes, the above limit
would be reached ultimately.

Cheers

On Thu, Dec 19, 2013 at 9:21 AM, Jonathan Hsieh <jon@cloudera.com> wrote:

> Ted,
>
> My question really boils down this this -- can we have a data loss if we
> don't take in HBASE-10142 or not?  It will take me a little more time
> reverse engineer and convince myself one way or another.    If we can lose
> data, then I'll roll take the port and do another rc.  If we don't (we are
> just less performant) then since we have 4 binding +1's other than me I'll
> release.
>
> I either case, I want to understand how it fixes the problem. I can see the
> changes e.g. the re-entrant lock, and the shifting out of
> checkLowReplication call, but I need to figure out why the changes were
> necessary. The main substantive change is that the lock is taken around the
> checkLowReplication call but I haven't put together why this fixes the
> problem.  Can you shed more light (and ideally lay out why the changes fix
> the problem in jira? -- maybe a bad multithread trace that is fixed by the
> new lock?)
>
> Thanks,
> Jon.
>
>
> On Thu, Dec 19, 2013 at 9:01 AM, Ted Yu <yuzhihong@gmail.com> wrote:
>
> > Jon:
> > If Stack gives the greenlight, I can certainly port it to 0.96 branch.
> >
> > Cheers
> >
> >
> > On Thu, Dec 19, 2013 at 8:39 AM, Jonathan Hsieh <jon@cloudera.com>
> wrote:
> >
> > > When I run the test standalone,  and didn't have an failure in
> > 0.96.1.1rc0
> > > or 0.96.1. When I ran the whole suite, I ran into exactly the same
> > failure
> > > on 0.96.1.1 (currnetly testing full suite from 0.96.1 src tar ball)
> > >
> > > I've spent some time reviewing HBASE-10142, There are some non-test
> code
> > > modifications still trying to determine if it is a serious problem or
> not
> > > on that side.   Ted, is there a reason why this wasn't ported to the
> 0.96
> > > branch?
> > >
> > > Jon.
> > >
> > >
> > > On Wed, Dec 18, 2013 at 9:00 PM, Jean-Marc Spaggiari <
> > > jean-marc@spaggiari.org> wrote:
> > >
> > > > Typo is because I have no done a cut&past ;)
> > > >
> > > > With cut&past: mvn test -PrunAllTests
> > -Dsurefire.secondPartThreadCount=8
> > > >
> > > > On the last run, error is:
> > > > Failed tests:
> > > >
> > > >
> > >
> >
> testLogRollOnDatanodeDeath(org.apache.hadoop.hbase.regionserver.wal.TestLogRolling):
> > > > LowReplication Roller should've been disabled, current replication=1
> > > >
> > > > I did not keep the errors from the previous runs...
> > > >
> > > >
> > > > 2013/12/18 Ted Yu <yuzhihong@gmail.com>
> > > >
> > > > > What error(s) did you see ?
> > > > >
> > > > > There was a typo in this def ('s' between D and u):
> > > > > -Dsurefire.secondPartThreadCount
> > > > >
> > > > > You can lower the thread count: in trunk build, value of 2 is used.
> > > > >
> > > > > Cheers
> > > > >
> > > > >
> > > > > On Wed, Dec 18, 2013 at 8:45 PM, Jean-Marc Spaggiari <
> > > > > jean-marc@spaggiari.org> wrote:
> > > > >
> > > > > > I tried multiple times over many hours to run:
> > > > > > mvn test -PrunallTests -Dusefire.secondPartThreadCount=8
> > > > > >
> > > > > > On a local machine using the src jar, with no success. I might
be
> > > > missing
> > > > > > something... I will investigate so I will be able to provide
> better
> > > > > > feedback for 0.96.2...
> > > > > >
> > > > > > Sorry about that.
> > > > > >
> > > > > >
> > > > > > 2013/12/18 Enis Söztutar <enis.soz@gmail.com>
> > > > > >
> > > > > > > +1.
> > > > > > >
> > > > > > > - downloaded the artifacts
> > > > > > > - checked checksums
> > > > > > > - checked sigs
> > > > > > > - checked hadoop libs in h1 / h2
> > > > > > > - checked directory layouts
> > > > > > > - run local cluster
> > > > > > > - run smoke tests with shell on the artifacts
> > > > > > > - run tests locally:
> > > > > > >   -- bin/hbase org.apache.hadoop.hbase.util.LoadTestTool
-write
> > > > > 10:10:100
> > > > > > > -num_keys 1000000 -read 100:30
> > > > > > >   -- bin/hbase
> > > > > "org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList
> > > > > > > Loop 1 1 3000000 /tmp/biglinkedlist 1"
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Dec 18, 2013 at 12:45 PM, Stack <stack@duboce.net>
> > wrote:
> > > > > > >
> > > > > > > > +1
> > > > > > > >
> > > > > > > > Downloaded, unbundled, checked layout, and ran it.
 Browsed
> the
> > > mvn
> > > > > > > > artifacts.
> > > > > > > >
> > > > > > > > St.Ack
> > > > > > > >
> > > > > > > >
> > > > > > > > On Tue, Dec 17, 2013 at 2:23 PM, Jonathan Hsieh <
> > > jon@cloudera.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > This is a quick-fix release directly off of the
0.96.1
> > release.
> > > > > It
> > > > > > > can
> > > > > > > > be
> > > > > > > > > downloaded here:
> > > > > > > > >
> > > > > > > > > http://people.apache.org/~jmhsieh/hbase-0.96.1.1rc0/
> > > > > > > > >
> > > > > > > > > The maven staging repo is here:
> > > > > > > > >
> > > > > > > > >
> > > > > >
> > > https://repository.apache.org/content/repositories/orgapachehbase-056/
> > > > > > > > >
> > > > > > > > > There is only one jira'ed patch in this release,
> HBASE-10188
> > > [1].
> > > > > >  This
> > > > > > > > > fixes an API incompatibility introduced between
0.96.0 and
> > > > 0.96.1.
> > > > > > > Other
> > > > > > > > > changes include updates to CHANGES.txt (to include
that
> > jira),
> > > > and
> > > > > > > > pom.xml
> > > > > > > > > (naming the release 0.96.1.1).
> > > > > > > > >
> > > > > > > > > As such, we'll have an abridged testing and voting
period.
> > > > Please
> > > > > > > have
> > > > > > > > a
> > > > > > > > > quick look and vote +1/-1 by 12/18/13 23:59 pacific
time.
> If
> > > this
> > > > > > > passes
> > > > > > > > > we'll take down 0.96.1.
> > > > > > > > >
> > > > > > > > > Please check the release mechanics -- this is
my first
> > attempt
> > > at
> > > > > an
> > > > > > > > hbase
> > > > > > > > > release.
> > > > > > > > >
> > > > > > > > > Thanks!
> > > > > > > > >
> > > > > > > > > [1] https://issues.apache.org/jira/browse/HBASE-10188
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > // Jonathan Hsieh (shay)
> > > > > > > > > // Software Engineer, Cloudera
> > > > > > > > > // jon@cloudera.com
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > // Jonathan Hsieh (shay)
> > > // Software Engineer, Cloudera
> > > // jon@cloudera.com
> > >
> >
>
>
>
> --
> // Jonathan Hsieh (shay)
> // Software Engineer, Cloudera
> // jon@cloudera.com
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message