hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Rawson <ryano...@gmail.com>
Subject Re: initial experience with HBase 0.90.1 rc0
Date Sun, 13 Feb 2011 20:24:20 GMT
every handler thread, and every reader and also the accept thread are
all blocked on flushing memstore.  The handlers get blocked, then the
readers also have a finite handoff queue and they are blocked and also
the accept.

But why isnt memstore flushing?  Do you have regionserver stats?  ie:
how much memstore global ram used?  That is found on the main page of
the regionserver http service, also found in ganglia/file stats.

I havent looked at the logs yet, I'm off to lunch now.

-ryan

On Sun, Feb 13, 2011 at 8:44 AM, Ted Yu <yuzhihong@gmail.com> wrote:
> I had 3 consecutive successful runs processing 200GB data for each run
> before hitting timeout problem in the 4th run.
>
> The 5th run couldn't proceed because master complained:
>
> 2011-02-13 16:11:45,173 FATAL org.apache.hadoop.hbase.master.HMaster: Failed
> assignment of regions to
> serverName=sjc1-hadoop6.sjc1.carrieriq.com,60020,1297518996557,
> load=(requests=0, regions=231, usedHeap=3535, maxHeap=3983)
>
> but sjc1-hadoop6.sjc1 claimed:
> 2011-02-13 16:13:32,258 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegionServer: No master found, will
> retry
>
> Here is stack trace for sjc1-hadoop6.sjc1:
> http://pastebin.com/X8zWLXqu
>
> I didn't have chance to capture master stack trace as master exited after
> that.
>
> I also attach master and region server log on sjc1-hadoop6.sjc1 - pardon me
> for including individual email addresses as attachments wouldn't go through
> hbase.apache.org
>
> On Thu, Feb 10, 2011 at 5:05 PM, Todd Lipcon <todd@cloudera.com> wrote:
>>
>> On Thu, Feb 10, 2011 at 4:54 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>>
>> > Thanks for the explanation.
>> > Assuming the mixed class loading is static, why did this situation
>> > develop
>> > after 40 minutes of heavy load :-(
>> >
>>
>> You didn't hit global memstore pressure until 40 minutes of load.
>>
>> -Todd
>>
>> On Thu, Feb 10, 2011 at 4:42 PM, Ryan Rawson <ryanobjc@gmail.com> wrote:
>> >
>> > > It's a standard linking issue, you get one class from one version
>> > > another from another, they are mostly compatible in terms of
>> > > signatures (hence no exceptions) but are subtly incompatible in
>> > > different ways. In the stack trace you posted, the handlers were
>> > > blocked in:
>> > >
>> > >        at
>> > >
>> >
>> > org.apache.hadoop.hbase.regionserver.MemStoreFlusher.reclaimMemStoreMemory(MemStoreFlusher.java:382)
>> > >
>> > > and the thread:
>> > >
>> > > "regionserver60020.cacheFlusher" daemon prio=10 tid=0x00002aaabc21e000
>> > > nid=0x7717 waiting for monitor entry [0x0000000000000000]
>> > >   java.lang.Thread.State: BLOCKED (on object monitor)
>> > >
>> > > was idle.
>> > >
>> > > The cache flusher thread should be flushing, and yet it's doing
>> > > nothing.  This also happens to be one of the classes that were
>> > > changed.
>> > >
>> > >
>> > >
>> > > On Thu, Feb 10, 2011 at 4:34 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>> > > > Can someone comment on my second question ?
>> > > > Thanks
>> > > >
>> > > > On Thu, Feb 10, 2011 at 4:25 PM, Ryan Rawson <ryanobjc@gmail.com>
>> > wrote:
>> > > >
>> > > >> As I suspected.
>> > > >>
>> > > >> It's a byproduct of our maven assembly process. The process could
>> > > >> be
>> > > >> fixed. I wouldn't mind. I don't support runtime checking of jars,
>> > > >> there is such thing as too much tests, and this is an example
of
>> > > >> it.
>> > > >> The check would then need a test, etc, etc.
>> > > >>
>> > > >> At SU we use new directories for each upgrade, copying the config
>> > > >> over. With the lack of -default.xml this is easier than ever (just
>> > > >> copy everything in conf/).  With symlink switchover it makes
roll
>> > > >> forward/back as simple as doing a symlink switchover or back.
I
>> > > >> have
>> > > >> to recommend this to everyone who doesnt have a management scheme.
>> > > >>
>> > > >> On Thu, Feb 10, 2011 at 4:20 PM, Ted Yu <yuzhihong@gmail.com>
>> > > >> wrote:
>> > > >> > hbase/hbase-0.90.1.jar leads lib/hbase-0.90.0.jar in the
>> > > >> > classpath.
>> > > >> > I wonder
>> > > >> > 1. why hbase jar is placed in two directories - 0.20.6 didn't
use
>> > such
>> > > >> > structure
>> > > >> > 2. what from lib/hbase-0.90.0.jar could be picked up and
why
>> > > >> > there
>> > > wasn't
>> > > >> > exception in server log
>> > > >> >
>> > > >> > I think a JIRA should be filed for item 2 above - bail out
when
>> > > >> > the
>> > > two
>> > > >> > hbase jars from $HBASE_HOME and $HBASE_HOME/lib are of different
>> > > >> versions.
>> > > >> >
>> > > >> > Cheers
>> > > >> >
>> > > >> > On Thu, Feb 10, 2011 at 3:40 PM, Ryan Rawson <ryanobjc@gmail.com>
>> > > wrote:
>> > > >> >
>> > > >> >> What do you get when you:
>> > > >> >>
>> > > >> >> ls lib/hbase*
>> > > >> >>
>> > > >> >> I'm going to guess there is hbase-0.90.0.jar there
>> > > >> >>
>> > > >> >>
>> > > >> >>
>> > > >> >> On Thu, Feb 10, 2011 at 3:25 PM, Ted Yu <yuzhihong@gmail.com>
>> > wrote:
>> > > >> >> > hbase-0.90.0-tests.jar and hbase-0.90.1.jar co-exist
>> > > >> >> > Would this be a problem ?
>> > > >> >> >
>> > > >> >> > On Thu, Feb 10, 2011 at 3:16 PM, Ryan Rawson
>> > > >> >> > <ryanobjc@gmail.com
>> > >
>> > > >> wrote:
>> > > >> >> >
>> > > >> >> >> You don't have both the old and the new hbase
jars in there
>> > > >> >> >> do
>> > > you?
>> > > >> >> >>
>> > > >> >> >> -ryan
>> > > >> >> >>
>> > > >> >> >> On Thu, Feb 10, 2011 at 3:12 PM, Ted Yu <yuzhihong@gmail.com>
>> > > wrote:
>> > > >> >> >> > .META. went offline during second flow
attempt.
>> > > >> >> >> >
>> > > >> >> >> > The time out I mentioned happened for 1st
and 3rd attempts.
>> > > HBase
>> > > >> was
>> > > >> >> >> > restarted before the 1st and 3rd attempts.
>> > > >> >> >> >
>> > > >> >> >> > Here is jstack:
>> > > >> >> >> > http://pastebin.com/EHMSvsRt
>> > > >> >> >> >
>> > > >> >> >> > On Thu, Feb 10, 2011 at 3:04 PM, Stack
<stack@duboce.net>
>> > > wrote:
>> > > >> >> >> >
>> > > >> >> >> >> So, .META. is not online?  What happens
if you use shell
>> > > >> >> >> >> at
>> > > this
>> > > >> >> time.
>> > > >> >> >> >>
>> > > >> >> >> >> Your attachement did not come across
Ted.  Mind
>> > > >> >> >> >> postbin'ing
>> > it?
>> > > >> >> >> >>
>> > > >> >> >> >> St.Ack
>> > > >> >> >> >>
>> > > >> >> >> >> On Thu, Feb 10, 2011 at 2:41 PM, Ted
Yu
>> > > >> >> >> >> <yuzhihong@gmail.com
>> > >
>> > > >> wrote:
>> > > >> >> >> >> > I replaced hbase jar with hbase-0.90.1.jar
>> > > >> >> >> >> > I also upgraded client side jar
to hbase-0.90.1.jar
>> > > >> >> >> >> >
>> > > >> >> >> >> > Our map tasks were running faster
than before for about
>> > > >> >> >> >> > 50
>> > > >> minutes.
>> > > >> >> >> >> However,
>> > > >> >> >> >> > map tasks then timed out calling
flushCommits(). This
>> > > happened
>> > > >> even
>> > > >> >> >> after
>> > > >> >> >> >> > fresh restart of hbase.
>> > > >> >> >> >> >
>> > > >> >> >> >> > I don't see any exception in region
server logs.
>> > > >> >> >> >> >
>> > > >> >> >> >> > In master log, I found:
>> > > >> >> >> >> >
>> > > >> >> >> >> > 2011-02-10 18:24:15,286 DEBUG
>> > > >> >> >> >> >
>> > > >> >> >> >> > org.apache.hadoop.hbase.master.handler.OpenedRegionHandler:
>> > > >> Opened
>> > > >> >> >> region
>> > > >> >> >> >> > -ROOT-,,0.70236052 on sjc1-hadoop6.X.com
>> > ,60020,1297362251595
>> > > >> >> >> >> > 2011-02-10 18:24:15,349 INFO
>> > > >> >> >> >> org.apache.hadoop.hbase.catalog.CatalogTracker:
>> > > >> >> >> >> > Failed verification of .META.,,1
at address=null;
>> > > >> >> >> >> > org.apache.hadoop.hbase.NotServingRegionException:
>> > > >> >> >> >> > org.apache.hadoop.hbase.NotServingRegionException:
>> > > >> >> >> >> > Region
>> > is
>> > > not
>> > > >> >> >> online:
>> > > >> >> >> >> > .META.,,1
>> > > >> >> >> >> > 2011-02-10 18:24:15,350 DEBUG
>> > > >> >> >> org.apache.hadoop.hbase.zookeeper.ZKAssign:
>> > > >> >> >> >> > master:60000-0x12e10d0e31e0000
Creating (or updating)
>> > > unassigned
>> > > >> >> node
>> > > >> >> >> for
>> > > >> >> >> >> > 1028785192 with OFFLINE state
>> > > >> >> >> >> >
>> > > >> >> >> >> > I am attaching region server (which
didn't respond to
>> > > >> >> stop-hbase.sh)
>> > > >> >> >> >> jstack.
>> > > >> >> >> >> >
>> > > >> >> >> >> > FYI
>> > > >> >> >> >> >
>> > > >> >> >> >> > On Thu, Feb 10, 2011 at 10:10
AM, Stack
>> > > >> >> >> >> > <stack@duboce.net>
>> > > >> wrote:
>> > > >> >> >> >> >>
>> > > >> >> >> >> >> Thats probably enough Ted.
 The 0.90.1
>> > > >> >> >> >> >> hbase-default.xml
>> > has
>> > > an
>> > > >> >> extra
>> > > >> >> >> >> >> config. to enable the experimental
HBASE-3455 feature
>> > > >> >> >> >> >> but
>> > > you
>> > > >> can
>> > > >> >> >> copy
>> > > >> >> >> >> >> that over if you want to try
playing with it (it
>> > > >> >> >> >> >> defaults
>> > > off
>> > > >> so
>> > > >> >> >> you'd
>> > > >> >> >> >> >> copy over the config. if you
wanted to set it to true).
>> > > >> >> >> >> >>
>> > > >> >> >> >> >> St.Ack
>> > > >> >> >> >> >
>> > > >> >> >> >> >
>> > > >> >> >> >>
>> > > >> >> >> >
>> > > >> >> >>
>> > > >> >> >
>> > > >> >>
>> > > >> >
>> > > >>
>> > > >
>> > >
>> >
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>
>

Mime
View raw message