hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: MemStoreFlusher and my initial experience with HBase 0.90.1 rc0
Date Mon, 14 Feb 2011 20:59:34 GMT
Hi Ted,

What was the reason for the region server shutdown? ie was it aborting
itself, or you send it a kill signal, or what?

Still trying to understand why this happened.

-Todd

On Mon, Feb 14, 2011 at 10:40 AM, Ted Yu <yuzhihong@gmail.com> wrote:

> Here is stack trace for one region server which didn't cleanly shutdown:
> http://pastebin.com/PEtEdi4g
>
> I noticed IPC Server handler 7 was holding lock in reclaimMemStoreMemory().
>
> Here is related snippet from the region server's log:
> http://pastebin.com/KeXppURX
>
> I noticed region 1297662719973.cd06c5f9e4a0ffcc16cb6a5e559cd5b3 was
> splitting.
>
> Please advise whether there could be any relation between the above snippet
> and lock up in MemStoreFlusher.
>
> Thanks
>
> On Mon, Feb 14, 2011 at 9:20 AM, Ted Yu <yuzhihong@gmail.com> wrote:
>
> > I disabled MSLAB.
> > My flow still couldn't make much progress.
> >
> >
> >> In this region server stack trace, I don't see
> >> MemStoreFlusher.reclaimMemStoreMemory() call:
> >> http://pastebin.com/uiBRidUa
> >>
> >>
> >> On Sun, Feb 13, 2011 at 1:14 PM, Ted Yu <yuzhihong@gmail.com> wrote:
> >>
> >>> I am using hadoop-core-0.20.2-322.jar downloaded from Ryan's repo.
> >>> FYI
> >>>
> >>>
> >>> On Sun, Feb 13, 2011 at 1:12 PM, Ted Yu <yuzhihong@gmail.com> wrote:
> >>>
> >>>> Since master server shut down, I restarted the cluster.
> >>>> The next flow over 200GB data got timed out.
> >>>>
> >>>> Here are some region server stat:
> >>>>
> >>>> request=0.0, regions=95, stores=213, storefiles=65,
> >>>> storefileIndexSize=99, memstoreSize=1311, compactionQueueSize=0,
> >>>> flushQueueSize=0, usedHeap=2532, maxHeap=3983, blockCacheSize=6853968,
> >>>> blockCacheFree=828520304, blockCacheCount=0, blockCacheHitCount=0,
> >>>> blockCacheMissCount=0, blockCacheEvictedCount=0, blockCacheHitRatio=0,
> >>>> blockCacheHitCachingRatio=0
> >>>>
> >>>> request=0.0, regions=95, stores=232, storefiles=72,
> >>>> storefileIndexSize=120, memstoreSize=301, compactionQueueSize=0,
> >>>> flushQueueSize=0, usedHeap=1740, maxHeap=3983,
> blockCacheSize=13110928,
> >>>> blockCacheFree=822263344, blockCacheCount=712,
> blockCacheHitCount=112478,
> >>>> blockCacheMissCount=712, blockCacheEvictedCount=0,
> blockCacheHitRatio=99,
> >>>> blockCacheHitCachingRatio=99
> >>>>
> >>>> Thanks
> >>>>
> >>>>
> >>>> On Sun, Feb 13, 2011 at 12:24 PM, Ryan Rawson <ryanobjc@gmail.com
> >wrote:
> >>>>
> >>>>> every handler thread, and every reader and also the accept thread
are
> >>>>> all blocked on flushing memstore.  The handlers get blocked, then
the
> >>>>> readers also have a finite handoff queue and they are blocked and
> also
> >>>>> the accept.
> >>>>>
> >>>>> But why isnt memstore flushing?  Do you have regionserver stats?
 ie:
> >>>>> how much memstore global ram used?  That is found on the main page
of
> >>>>> the regionserver http service, also found in ganglia/file stats.
> >>>>>
> >>>>> I havent looked at the logs yet, I'm off to lunch now.
> >>>>>
> >>>>> -ryan
> >>>>>
> >>>>> On Sun, Feb 13, 2011 at 8:44 AM, Ted Yu <yuzhihong@gmail.com>
wrote:
> >>>>> > I had 3 consecutive successful runs processing 200GB data for
each
> >>>>> run
> >>>>> > before hitting timeout problem in the 4th run.
> >>>>> >
> >>>>> > The 5th run couldn't proceed because master complained:
> >>>>> >
> >>>>> > 2011-02-13 16:11:45,173 FATAL
> org.apache.hadoop.hbase.master.HMaster:
> >>>>> Failed
> >>>>> > assignment of regions to
> >>>>> > serverName=sjc1-hadoop6.sjc1.carrieriq.com,60020,1297518996557,
> >>>>> > load=(requests=0, regions=231, usedHeap=3535, maxHeap=3983)
> >>>>> >
> >>>>> > but sjc1-hadoop6.sjc1 claimed:
> >>>>> > 2011-02-13 16:13:32,258 DEBUG
> >>>>> > org.apache.hadoop.hbase.regionserver.HRegionServer: No master
> found,
> >>>>> will
> >>>>> > retry
> >>>>> >
> >>>>> > Here is stack trace for sjc1-hadoop6.sjc1:
> >>>>> > http://pastebin.com/X8zWLXqu
> >>>>> >
> >>>>> > I didn't have chance to capture master stack trace as master
exited
> >>>>> after
> >>>>> > that.
> >>>>> >
> >>>>> > I also attach master and region server log on sjc1-hadoop6.sjc1
-
> >>>>> pardon me
> >>>>> > for including individual email addresses as attachments wouldn't
go
> >>>>> through
> >>>>> > hbase.apache.org
> >>>>> >
> >>>>> > On Thu, Feb 10, 2011 at 5:05 PM, Todd Lipcon <todd@cloudera.com>
> >>>>> wrote:
> >>>>> >>
> >>>>> >> On Thu, Feb 10, 2011 at 4:54 PM, Ted Yu <yuzhihong@gmail.com>
> >>>>> wrote:
> >>>>> >>
> >>>>> >> > Thanks for the explanation.
> >>>>> >> > Assuming the mixed class loading is static, why did
this
> situation
> >>>>> >> > develop
> >>>>> >> > after 40 minutes of heavy load :-(
> >>>>> >> >
> >>>>> >>
> >>>>> >> You didn't hit global memstore pressure until 40 minutes
of load.
> >>>>> >>
> >>>>> >> -Todd
> >>>>> >>
> >>>>> >> On Thu, Feb 10, 2011 at 4:42 PM, Ryan Rawson <ryanobjc@gmail.com>
> >>>>> wrote:
> >>>>> >> >
> >>>>> >> > > It's a standard linking issue, you get one class
from one
> >>>>> version
> >>>>> >> > > another from another, they are mostly compatible
in terms of
> >>>>> >> > > signatures (hence no exceptions) but are subtly
incompatible
> in
> >>>>> >> > > different ways. In the stack trace you posted,
the handlers
> were
> >>>>> >> > > blocked in:
> >>>>> >> > >
> >>>>> >> > >        at
> >>>>> >> > >
> >>>>> >> >
> >>>>> >> >
> >>>>>
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.reclaimMemStoreMemory(MemStoreFlusher.java:382)
> >>>>> >> > >
> >>>>> >> > > and the thread:
> >>>>> >> > >
> >>>>> >> > > "regionserver60020.cacheFlusher" daemon prio=10
> >>>>> tid=0x00002aaabc21e000
> >>>>> >> > > nid=0x7717 waiting for monitor entry [0x0000000000000000]
> >>>>> >> > >   java.lang.Thread.State: BLOCKED (on object
monitor)
> >>>>> >> > >
> >>>>> >> > > was idle.
> >>>>> >> > >
> >>>>> >> > > The cache flusher thread should be flushing,
and yet it's
> doing
> >>>>> >> > > nothing.  This also happens to be one of the
classes that were
> >>>>> >> > > changed.
> >>>>> >> > >
> >>>>> >> > >
> >>>>> >> > >
> >>>>> >> > > On Thu, Feb 10, 2011 at 4:34 PM, Ted Yu <yuzhihong@gmail.com>
> >>>>> wrote:
> >>>>> >> > > > Can someone comment on my second question
?
> >>>>> >> > > > Thanks
> >>>>> >> > > >
> >>>>> >> > > > On Thu, Feb 10, 2011 at 4:25 PM, Ryan Rawson
<
> >>>>> ryanobjc@gmail.com>
> >>>>> >> > wrote:
> >>>>> >> > > >
> >>>>> >> > > >> As I suspected.
> >>>>> >> > > >>
> >>>>> >> > > >> It's a byproduct of our maven assembly
process. The process
> >>>>> could
> >>>>> >> > > >> be
> >>>>> >> > > >> fixed. I wouldn't mind. I don't support
runtime checking of
> >>>>> jars,
> >>>>> >> > > >> there is such thing as too much tests,
and this is an
> example
> >>>>> of
> >>>>> >> > > >> it.
> >>>>> >> > > >> The check would then need a test, etc,
etc.
> >>>>> >> > > >>
> >>>>> >> > > >> At SU we use new directories for each
upgrade, copying the
> >>>>> config
> >>>>> >> > > >> over. With the lack of -default.xml
this is easier than
> ever
> >>>>> (just
> >>>>> >> > > >> copy everything in conf/).  With symlink
switchover it
> makes
> >>>>> roll
> >>>>> >> > > >> forward/back as simple as doing a symlink
switchover or
> back.
> >>>>> I
> >>>>> >> > > >> have
> >>>>> >> > > >> to recommend this to everyone who doesnt
have a management
> >>>>> scheme.
> >>>>> >> > > >>
> >>>>> >> > > >> On Thu, Feb 10, 2011 at 4:20 PM, Ted
Yu <
> yuzhihong@gmail.com
> >>>>> >
> >>>>> >> > > >> wrote:
> >>>>> >> > > >> > hbase/hbase-0.90.1.jar leads lib/hbase-0.90.0.jar
in the
> >>>>> >> > > >> > classpath.
> >>>>> >> > > >> > I wonder
> >>>>> >> > > >> > 1. why hbase jar is placed in two
directories - 0.20.6
> >>>>> didn't use
> >>>>> >> > such
> >>>>> >> > > >> > structure
> >>>>> >> > > >> > 2. what from lib/hbase-0.90.0.jar
could be picked up and
> >>>>> why
> >>>>> >> > > >> > there
> >>>>> >> > > wasn't
> >>>>> >> > > >> > exception in server log
> >>>>> >> > > >> >
> >>>>> >> > > >> > I think a JIRA should be filed
for item 2 above - bail
> out
> >>>>> when
> >>>>> >> > > >> > the
> >>>>> >> > > two
> >>>>> >> > > >> > hbase jars from $HBASE_HOME and
$HBASE_HOME/lib are of
> >>>>> different
> >>>>> >> > > >> versions.
> >>>>> >> > > >> >
> >>>>> >> > > >> > Cheers
> >>>>> >> > > >> >
> >>>>> >> > > >> > On Thu, Feb 10, 2011 at 3:40 PM,
Ryan Rawson <
> >>>>> ryanobjc@gmail.com>
> >>>>> >> > > wrote:
> >>>>> >> > > >> >
> >>>>> >> > > >> >> What do you get when you:
> >>>>> >> > > >> >>
> >>>>> >> > > >> >> ls lib/hbase*
> >>>>> >> > > >> >>
> >>>>> >> > > >> >> I'm going to guess there is
hbase-0.90.0.jar there
> >>>>> >> > > >> >>
> >>>>> >> > > >> >>
> >>>>> >> > > >> >>
> >>>>> >> > > >> >> On Thu, Feb 10, 2011 at 3:25
PM, Ted Yu <
> >>>>> yuzhihong@gmail.com>
> >>>>> >> > wrote:
> >>>>> >> > > >> >> > hbase-0.90.0-tests.jar
and hbase-0.90.1.jar co-exist
> >>>>> >> > > >> >> > Would this be a problem
?
> >>>>> >> > > >> >> >
> >>>>> >> > > >> >> > On Thu, Feb 10, 2011 at
3:16 PM, Ryan Rawson
> >>>>> >> > > >> >> > <ryanobjc@gmail.com
> >>>>> >> > >
> >>>>> >> > > >> wrote:
> >>>>> >> > > >> >> >
> >>>>> >> > > >> >> >> You don't have both
the old and the new hbase jars in
> >>>>> there
> >>>>> >> > > >> >> >> do
> >>>>> >> > > you?
> >>>>> >> > > >> >> >>
> >>>>> >> > > >> >> >> -ryan
> >>>>> >> > > >> >> >>
> >>>>> >> > > >> >> >> On Thu, Feb 10, 2011
at 3:12 PM, Ted Yu <
> >>>>> yuzhihong@gmail.com>
> >>>>> >> > > wrote:
> >>>>> >> > > >> >> >> > .META. went offline
during second flow attempt.
> >>>>> >> > > >> >> >> >
> >>>>> >> > > >> >> >> > The time out
I mentioned happened for 1st and 3rd
> >>>>> attempts.
> >>>>> >> > > HBase
> >>>>> >> > > >> was
> >>>>> >> > > >> >> >> > restarted before
the 1st and 3rd attempts.
> >>>>> >> > > >> >> >> >
> >>>>> >> > > >> >> >> > Here is jstack:
> >>>>> >> > > >> >> >> > http://pastebin.com/EHMSvsRt
> >>>>> >> > > >> >> >> >
> >>>>> >> > > >> >> >> > On Thu, Feb 10,
2011 at 3:04 PM, Stack <
> >>>>> stack@duboce.net>
> >>>>> >> > > wrote:
> >>>>> >> > > >> >> >> >
> >>>>> >> > > >> >> >> >> So, .META.
is not online?  What happens if you use
> >>>>> shell
> >>>>> >> > > >> >> >> >> at
> >>>>> >> > > this
> >>>>> >> > > >> >> time.
> >>>>> >> > > >> >> >> >>
> >>>>> >> > > >> >> >> >> Your attachement
did not come across Ted.  Mind
> >>>>> >> > > >> >> >> >> postbin'ing
> >>>>> >> > it?
> >>>>> >> > > >> >> >> >>
> >>>>> >> > > >> >> >> >> St.Ack
> >>>>> >> > > >> >> >> >>
> >>>>> >> > > >> >> >> >> On Thu, Feb
10, 2011 at 2:41 PM, Ted Yu
> >>>>> >> > > >> >> >> >> <yuzhihong@gmail.com
> >>>>> >> > >
> >>>>> >> > > >> wrote:
> >>>>> >> > > >> >> >> >> > I replaced
hbase jar with hbase-0.90.1.jar
> >>>>> >> > > >> >> >> >> > I also
upgraded client side jar to
> >>>>> hbase-0.90.1.jar
> >>>>> >> > > >> >> >> >> >
> >>>>> >> > > >> >> >> >> > Our
map tasks were running faster than before
> for
> >>>>> about
> >>>>> >> > > >> >> >> >> > 50
> >>>>> >> > > >> minutes.
> >>>>> >> > > >> >> >> >> However,
> >>>>> >> > > >> >> >> >> > map
tasks then timed out calling flushCommits().
> >>>>> This
> >>>>> >> > > happened
> >>>>> >> > > >> even
> >>>>> >> > > >> >> >> after
> >>>>> >> > > >> >> >> >> > fresh
restart of hbase.
> >>>>> >> > > >> >> >> >> >
> >>>>> >> > > >> >> >> >> > I don't
see any exception in region server logs.
> >>>>> >> > > >> >> >> >> >
> >>>>> >> > > >> >> >> >> > In master
log, I found:
> >>>>> >> > > >> >> >> >> >
> >>>>> >> > > >> >> >> >> > 2011-02-10
18:24:15,286 DEBUG
> >>>>> >> > > >> >> >> >> >
> >>>>> >> > > >> >> >> >> >
> >>>>> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler:
> >>>>> >> > > >> Opened
> >>>>> >> > > >> >> >> region
> >>>>> >> > > >> >> >> >> > -ROOT-,,0.70236052
on sjc1-hadoop6.X.com
> >>>>> >> > ,60020,1297362251595
> >>>>> >> > > >> >> >> >> > 2011-02-10
18:24:15,349 INFO
> >>>>> >> > > >> >> >> >> org.apache.hadoop.hbase.catalog.CatalogTracker:
> >>>>> >> > > >> >> >> >> > Failed
verification of .META.,,1 at
> address=null;
> >>>>> >> > > >> >> >> >> >
> org.apache.hadoop.hbase.NotServingRegionException:
> >>>>> >> > > >> >> >> >> >
> org.apache.hadoop.hbase.NotServingRegionException:
> >>>>> >> > > >> >> >> >> > Region
> >>>>> >> > is
> >>>>> >> > > not
> >>>>> >> > > >> >> >> online:
> >>>>> >> > > >> >> >> >> > .META.,,1
> >>>>> >> > > >> >> >> >> > 2011-02-10
18:24:15,350 DEBUG
> >>>>> >> > > >> >> >> org.apache.hadoop.hbase.zookeeper.ZKAssign:
> >>>>> >> > > >> >> >> >> > master:60000-0x12e10d0e31e0000
Creating (or
> >>>>> updating)
> >>>>> >> > > unassigned
> >>>>> >> > > >> >> node
> >>>>> >> > > >> >> >> for
> >>>>> >> > > >> >> >> >> > 1028785192
with OFFLINE state
> >>>>> >> > > >> >> >> >> >
> >>>>> >> > > >> >> >> >> > I am
attaching region server (which didn't
> respond
> >>>>> to
> >>>>> >> > > >> >> stop-hbase.sh)
> >>>>> >> > > >> >> >> >> jstack.
> >>>>> >> > > >> >> >> >> >
> >>>>> >> > > >> >> >> >> > FYI
> >>>>> >> > > >> >> >> >> >
> >>>>> >> > > >> >> >> >> > On Thu,
Feb 10, 2011 at 10:10 AM, Stack
> >>>>> >> > > >> >> >> >> > <stack@duboce.net>
> >>>>> >> > > >> wrote:
> >>>>> >> > > >> >> >> >> >>
> >>>>> >> > > >> >> >> >> >>
Thats probably enough Ted.  The 0.90.1
> >>>>> >> > > >> >> >> >> >>
hbase-default.xml
> >>>>> >> > has
> >>>>> >> > > an
> >>>>> >> > > >> >> extra
> >>>>> >> > > >> >> >> >> >>
config. to enable the experimental HBASE-3455
> >>>>> feature
> >>>>> >> > > >> >> >> >> >>
but
> >>>>> >> > > you
> >>>>> >> > > >> can
> >>>>> >> > > >> >> >> copy
> >>>>> >> > > >> >> >> >> >>
that over if you want to try playing with it
> (it
> >>>>> >> > > >> >> >> >> >>
defaults
> >>>>> >> > > off
> >>>>> >> > > >> so
> >>>>> >> > > >> >> >> you'd
> >>>>> >> > > >> >> >> >> >>
copy over the config. if you wanted to set it
> to
> >>>>> true).
> >>>>> >> > > >> >> >> >> >>
> >>>>> >> > > >> >> >> >> >>
St.Ack
> >>>>> >> > > >> >> >> >> >
> >>>>> >> > > >> >> >> >> >
> >>>>> >> > > >> >> >> >>
> >>>>> >> > > >> >> >> >
> >>>>> >> > > >> >> >>
> >>>>> >> > > >> >> >
> >>>>> >> > > >> >>
> >>>>> >> > > >> >
> >>>>> >> > > >>
> >>>>> >> > > >
> >>>>> >> > >
> >>>>> >> >
> >>>>> >>
> >>>>> >>
> >>>>> >>
> >>>>> >> --
> >>>>> >> Todd Lipcon
> >>>>> >> Software Engineer, Cloudera
> >>>>> >
> >>>>> >
> >>>>>
> >>>>
> >>>>
> >>>
> >>
> >
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message