hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Todd Lipcon <t...@cloudera.com>
Subject Re: MemStoreFlusher and my initial experience with HBase 0.90.1 rc0
Date Tue, 15 Feb 2011 02:27:05 GMT
On Mon, Feb 14, 2011 at 6:23 PM, Ted Yu <yuzhihong@gmail.com> wrote:

> Looks like the region server came down after I left office.
> Still, the second flow wasn't able to proceed writing to hbase. I hope such
> stress test can pass.
>
>
Does the same stress test pass on 0.90.0?

I don't think failures should block 0.90.1, unless the same test passes on
0.90.0. ie so long as 0.90.1 is better than our previous release, we should
release it, and then get to work on 0.90.2 :)

-Todd


>
> On Mon, Feb 14, 2011 at 5:23 PM, Todd Lipcon <todd@cloudera.com> wrote:
>
> > Looks like it's just in the process of flushing something:
> >
> > "regionserver60020.cacheFlusher" daemon prio=10 tid=0x0000000050ce3000
> > nid=0x32a9 runnable [0x0000000043138000]
> >   java.lang.Thread.State: RUNNABLE
> >        at
> >
> >
> org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:797)
> >        at
> >
> >
> org.apache.hadoop.hbase.regionserver.Store.internalFlushCache(Store.java:479)
> >        - locked <0x000000072aecd6b8> (a java.lang.Object)
> >        at
> > org.apache.hadoop.hbase.regionserver.Store.flushCache(Store.java:448)
> >        at
> > org.apache.hadoop.hbase.regionserver.Store.access$100(Store.java:81)
> >        at
> >
> >
> org.apache.hadoop.hbase.regionserver.Store$StoreFlusherImpl.flushCache(Store.java:1508)
> >        at
> >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:967)
> >        at
> >
> >
> org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:894)
> >        at
> > org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:846)
> >        at
> >
> >
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:386)
> >        at
> >
> >
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushOneForGlobalPressure(MemStoreFlusher.java:194)
> >        at
> >
> >
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.run(MemStoreFlusher.java:214)
> >
> > ie this is expected behavior. Does it not go down eventually?
> >
> > On Mon, Feb 14, 2011 at 5:17 PM, Ted Yu <yuzhihong@gmail.com> wrote:
> >
> > > I applied Todd's patch from HBASE-3531 locally and built hbase jar.
> > >
> > > I restarted cluster. The first flow over 200GB data succeeded.
> > > The second flow got stuck. So I tried to shutdown the cluster.
> > >
> > > Here is stack trace for one of the region servers that refused to go
> > down:
> > > http://pastebin.com/yJQhhYp8
> > >
> > > This is particularly interesting (other threads were waiting to lock
> > > 0x000000070be8a028):
> > >
> > >   1. "IPC Server handler 5 on 60020" daemon prio=10
> > tid=0x00002aaab883c000
> > >   nid=0x32bb waiting on condition [0x0000000043f46000]
> > >   2.    java.lang.Thread.State: WAITING (parking)
> > >   3.         at sun.misc.Unsafe.park(Native Method)
> > >   4.         - parking to wait for  <0x000000070ca468a8> (a
> > >   java.util.concurrent.locks.ReentrantLock$NonfairSync)
> > >   5.         at
> > >   java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
> > >   6.         at
> > >
> > >
> >
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
> > >   7.         at
> > >
> > >
> >
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:842)
> > >   8.         at
> > >
> > >
> >
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1178)
> > >   9.         at
> > >
> > >
> >
> java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:186)
> > >   10.         at
> > >   java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:262)
> > >   11.         at
> > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.reclaimMemStoreMemory(MemStoreFlusher.java:429)
> > >   12.         - locked <0x000000070be8a028> (a
> > >   org.apache.hadoop.hbase.regionserver.MemStoreFlusher)
> > >   13.         at
> > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.HRegionServer.multi(HRegionServer.java:2562)
> > >
> > > Thanks for your attention.
> > >
> > > On Mon, Feb 14, 2011 at 1:12 PM, Todd Lipcon <todd@cloudera.com>
> wrote:
> > >
> > > > Gotcha, OK. I think I understand what's going on then.
> > > > I'm guessing you have a TON of log messages like this?
> > > >
> > > > 2011-02-14 06:27:23,116 DEBUG
> > > org.apache.hadoop.hbase.regionserver.HRegion:
> > > > NOT flushing memstore for region
> > > >
> > > >
> > >
> >
> NIGHTLYDEVGRIDSGRIDSQL-THREEGPPSPEECHCALLS-1297661454931,E\xC2\xC8\xBC\xD4\xA5\x04E\xC2\xC8\xBC\xD4\xA5\x04E\xC2\xC8\xBC\xD4\xA5\x04E\xC2\xC8\xBC\xD4\xA5\x04E\xC2\xC8\xAA,1297662719973.cd06c5f9e4a0ffcc16cb6a5e559cd5b3.,
> > > > flushing=false, writesEnabled=false
> > > >
> > > >
> > > > -Todd
> > > >
> > > > On Mon, Feb 14, 2011 at 1:08 PM, Ted Yu <yuzhihong@gmail.com> wrote:
> > > >
> > > > > The lock up happened before I shut down region server.
> > > > > I had to get out of that situation so that I can continue 0.90.1
> > > > > validation.
> > > > >
> > > > > On Mon, Feb 14, 2011 at 12:59 PM, Todd Lipcon <todd@cloudera.com>
> > > wrote:
> > > > >
> > > > > > Hi Ted,
> > > > > >
> > > > > > What was the reason for the region server shutdown? ie was it
> > > aborting
> > > > > > itself, or you send it a kill signal, or what?
> > > > > >
> > > > > > Still trying to understand why this happened.
> > > > > >
> > > > > > -Todd
> > > > > >
> > > > > > On Mon, Feb 14, 2011 at 10:40 AM, Ted Yu <yuzhihong@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > Here is stack trace for one region server which didn't
cleanly
> > > > > shutdown:
> > > > > > > http://pastebin.com/PEtEdi4g
> > > > > > >
> > > > > > > I noticed IPC Server handler 7 was holding lock in
> > > > > > reclaimMemStoreMemory().
> > > > > > >
> > > > > > > Here is related snippet from the region server's log:
> > > > > > > http://pastebin.com/KeXppURX
> > > > > > >
> > > > > > > I noticed region 1297662719973.cd06c5f9e4a0ffcc16cb6a5e559cd5b3
> > was
> > > > > > > splitting.
> > > > > > >
> > > > > > > Please advise whether there could be any relation between
the
> > above
> > > > > > snippet
> > > > > > > and lock up in MemStoreFlusher.
> > > > > > >
> > > > > > > Thanks
> > > > > > >
> > > > > > > On Mon, Feb 14, 2011 at 9:20 AM, Ted Yu <yuzhihong@gmail.com>
> > > wrote:
> > > > > > >
> > > > > > > > I disabled MSLAB.
> > > > > > > > My flow still couldn't make much progress.
> > > > > > > >
> > > > > > > >
> > > > > > > >> In this region server stack trace, I don't see
> > > > > > > >> MemStoreFlusher.reclaimMemStoreMemory() call:
> > > > > > > >> http://pastebin.com/uiBRidUa
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> On Sun, Feb 13, 2011 at 1:14 PM, Ted Yu <
> yuzhihong@gmail.com>
> > > > > wrote:
> > > > > > > >>
> > > > > > > >>> I am using hadoop-core-0.20.2-322.jar downloaded
from
> Ryan's
> > > > repo.
> > > > > > > >>> FYI
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>> On Sun, Feb 13, 2011 at 1:12 PM, Ted Yu <
> yuzhihong@gmail.com
> > >
> > > > > wrote:
> > > > > > > >>>
> > > > > > > >>>> Since master server shut down, I restarted
the cluster.
> > > > > > > >>>> The next flow over 200GB data got timed
out.
> > > > > > > >>>>
> > > > > > > >>>> Here are some region server stat:
> > > > > > > >>>>
> > > > > > > >>>> request=0.0, regions=95, stores=213, storefiles=65,
> > > > > > > >>>> storefileIndexSize=99, memstoreSize=1311,
> > > compactionQueueSize=0,
> > > > > > > >>>> flushQueueSize=0, usedHeap=2532, maxHeap=3983,
> > > > > > blockCacheSize=6853968,
> > > > > > > >>>> blockCacheFree=828520304, blockCacheCount=0,
> > > > blockCacheHitCount=0,
> > > > > > > >>>> blockCacheMissCount=0, blockCacheEvictedCount=0,
> > > > > > blockCacheHitRatio=0,
> > > > > > > >>>> blockCacheHitCachingRatio=0
> > > > > > > >>>>
> > > > > > > >>>> request=0.0, regions=95, stores=232, storefiles=72,
> > > > > > > >>>> storefileIndexSize=120, memstoreSize=301,
> > > compactionQueueSize=0,
> > > > > > > >>>> flushQueueSize=0, usedHeap=1740, maxHeap=3983,
> > > > > > > blockCacheSize=13110928,
> > > > > > > >>>> blockCacheFree=822263344, blockCacheCount=712,
> > > > > > > blockCacheHitCount=112478,
> > > > > > > >>>> blockCacheMissCount=712, blockCacheEvictedCount=0,
> > > > > > > blockCacheHitRatio=99,
> > > > > > > >>>> blockCacheHitCachingRatio=99
> > > > > > > >>>>
> > > > > > > >>>> Thanks
> > > > > > > >>>>
> > > > > > > >>>>
> > > > > > > >>>> On Sun, Feb 13, 2011 at 12:24 PM, Ryan
Rawson <
> > > > ryanobjc@gmail.com
> > > > > > > >wrote:
> > > > > > > >>>>
> > > > > > > >>>>> every handler thread, and every reader
and also the
> accept
> > > > thread
> > > > > > are
> > > > > > > >>>>> all blocked on flushing memstore.
 The handlers get
> > blocked,
> > > > then
> > > > > > the
> > > > > > > >>>>> readers also have a finite handoff
queue and they are
> > blocked
> > > > and
> > > > > > > also
> > > > > > > >>>>> the accept.
> > > > > > > >>>>>
> > > > > > > >>>>> But why isnt memstore flushing?  Do
you have regionserver
> > > > stats?
> > > > > >  ie:
> > > > > > > >>>>> how much memstore global ram used?
 That is found on the
> > main
> > > > > page
> > > > > > of
> > > > > > > >>>>> the regionserver http service, also
found in ganglia/file
> > > > stats.
> > > > > > > >>>>>
> > > > > > > >>>>> I havent looked at the logs yet, I'm
off to lunch now.
> > > > > > > >>>>>
> > > > > > > >>>>> -ryan
> > > > > > > >>>>>
> > > > > > > >>>>> On Sun, Feb 13, 2011 at 8:44 AM, Ted
Yu <
> > yuzhihong@gmail.com
> > > >
> > > > > > wrote:
> > > > > > > >>>>> > I had 3 consecutive successful
runs processing 200GB
> data
> > > for
> > > > > > each
> > > > > > > >>>>> run
> > > > > > > >>>>> > before hitting timeout problem
in the 4th run.
> > > > > > > >>>>> >
> > > > > > > >>>>> > The 5th run couldn't proceed
because master complained:
> > > > > > > >>>>> >
> > > > > > > >>>>> > 2011-02-13 16:11:45,173 FATAL
> > > > > > > org.apache.hadoop.hbase.master.HMaster:
> > > > > > > >>>>> Failed
> > > > > > > >>>>> > assignment of regions to
> > > > > > > >>>>> > serverName=sjc1-hadoop6.sjc1.carrieriq.com
> > > > > ,60020,1297518996557,
> > > > > > > >>>>> > load=(requests=0, regions=231,
usedHeap=3535,
> > maxHeap=3983)
> > > > > > > >>>>> >
> > > > > > > >>>>> > but sjc1-hadoop6.sjc1 claimed:
> > > > > > > >>>>> > 2011-02-13 16:13:32,258 DEBUG
> > > > > > > >>>>> > org.apache.hadoop.hbase.regionserver.HRegionServer:
No
> > > master
> > > > > > > found,
> > > > > > > >>>>> will
> > > > > > > >>>>> > retry
> > > > > > > >>>>> >
> > > > > > > >>>>> > Here is stack trace for sjc1-hadoop6.sjc1:
> > > > > > > >>>>> > http://pastebin.com/X8zWLXqu
> > > > > > > >>>>> >
> > > > > > > >>>>> > I didn't have chance to capture
master stack trace as
> > > master
> > > > > > exited
> > > > > > > >>>>> after
> > > > > > > >>>>> > that.
> > > > > > > >>>>> >
> > > > > > > >>>>> > I also attach master and region
server log on
> > > > sjc1-hadoop6.sjc1
> > > > > -
> > > > > > > >>>>> pardon me
> > > > > > > >>>>> > for including individual email
addresses as attachments
> > > > > wouldn't
> > > > > > go
> > > > > > > >>>>> through
> > > > > > > >>>>> > hbase.apache.org
> > > > > > > >>>>> >
> > > > > > > >>>>> > On Thu, Feb 10, 2011 at 5:05
PM, Todd Lipcon <
> > > > > todd@cloudera.com>
> > > > > > > >>>>> wrote:
> > > > > > > >>>>> >>
> > > > > > > >>>>> >> On Thu, Feb 10, 2011 at 4:54
PM, Ted Yu <
> > > > yuzhihong@gmail.com>
> > > > > > > >>>>> wrote:
> > > > > > > >>>>> >>
> > > > > > > >>>>> >> > Thanks for the explanation.
> > > > > > > >>>>> >> > Assuming the mixed class
loading is static, why did
> > this
> > > > > > > situation
> > > > > > > >>>>> >> > develop
> > > > > > > >>>>> >> > after 40 minutes of
heavy load :-(
> > > > > > > >>>>> >> >
> > > > > > > >>>>> >>
> > > > > > > >>>>> >> You didn't hit global memstore
pressure until 40
> minutes
> > > of
> > > > > > load.
> > > > > > > >>>>> >>
> > > > > > > >>>>> >> -Todd
> > > > > > > >>>>> >>
> > > > > > > >>>>> >> On Thu, Feb 10, 2011 at 4:42
PM, Ryan Rawson <
> > > > > > ryanobjc@gmail.com>
> > > > > > > >>>>> wrote:
> > > > > > > >>>>> >> >
> > > > > > > >>>>> >> > > It's a standard
linking issue, you get one class
> > from
> > > > one
> > > > > > > >>>>> version
> > > > > > > >>>>> >> > > another from another,
they are mostly compatible
> in
> > > > terms
> > > > > of
> > > > > > > >>>>> >> > > signatures (hence
no exceptions) but are subtly
> > > > > incompatible
> > > > > > > in
> > > > > > > >>>>> >> > > different ways.
In the stack trace you posted, the
> > > > > handlers
> > > > > > > were
> > > > > > > >>>>> >> > > blocked in:
> > > > > > > >>>>> >> > >
> > > > > > > >>>>> >> > >        at
> > > > > > > >>>>> >> > >
> > > > > > > >>>>> >> >
> > > > > > > >>>>> >> >
> > > > > > > >>>>>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.reclaimMemStoreMemory(MemStoreFlusher.java:382)
> > > > > > > >>>>> >> > >
> > > > > > > >>>>> >> > > and the thread:
> > > > > > > >>>>> >> > >
> > > > > > > >>>>> >> > > "regionserver60020.cacheFlusher"
daemon prio=10
> > > > > > > >>>>> tid=0x00002aaabc21e000
> > > > > > > >>>>> >> > > nid=0x7717 waiting
for monitor entry
> > > > [0x0000000000000000]
> > > > > > > >>>>> >> > >   java.lang.Thread.State:
BLOCKED (on object
> > monitor)
> > > > > > > >>>>> >> > >
> > > > > > > >>>>> >> > > was idle.
> > > > > > > >>>>> >> > >
> > > > > > > >>>>> >> > > The cache flusher
thread should be flushing, and
> yet
> > > > it's
> > > > > > > doing
> > > > > > > >>>>> >> > > nothing.  This
also happens to be one of the
> classes
> > > > that
> > > > > > were
> > > > > > > >>>>> >> > > changed.
> > > > > > > >>>>> >> > >
> > > > > > > >>>>> >> > >
> > > > > > > >>>>> >> > >
> > > > > > > >>>>> >> > > On Thu, Feb 10,
2011 at 4:34 PM, Ted Yu <
> > > > > > yuzhihong@gmail.com>
> > > > > > > >>>>> wrote:
> > > > > > > >>>>> >> > > > Can someone
comment on my second question ?
> > > > > > > >>>>> >> > > > Thanks
> > > > > > > >>>>> >> > > >
> > > > > > > >>>>> >> > > > On Thu, Feb
10, 2011 at 4:25 PM, Ryan Rawson <
> > > > > > > >>>>> ryanobjc@gmail.com>
> > > > > > > >>>>> >> > wrote:
> > > > > > > >>>>> >> > > >
> > > > > > > >>>>> >> > > >> As I suspected.
> > > > > > > >>>>> >> > > >>
> > > > > > > >>>>> >> > > >> It's a
byproduct of our maven assembly process.
> > The
> > > > > > process
> > > > > > > >>>>> could
> > > > > > > >>>>> >> > > >> be
> > > > > > > >>>>> >> > > >> fixed.
I wouldn't mind. I don't support runtime
> > > > > checking
> > > > > > of
> > > > > > > >>>>> jars,
> > > > > > > >>>>> >> > > >> there
is such thing as too much tests, and this
> > is
> > > an
> > > > > > > example
> > > > > > > >>>>> of
> > > > > > > >>>>> >> > > >> it.
> > > > > > > >>>>> >> > > >> The check
would then need a test, etc, etc.
> > > > > > > >>>>> >> > > >>
> > > > > > > >>>>> >> > > >> At SU
we use new directories for each upgrade,
> > > > copying
> > > > > > the
> > > > > > > >>>>> config
> > > > > > > >>>>> >> > > >> over.
With the lack of -default.xml this is
> > easier
> > > > than
> > > > > > > ever
> > > > > > > >>>>> (just
> > > > > > > >>>>> >> > > >> copy everything
in conf/).  With symlink
> > switchover
> > > > it
> > > > > > > makes
> > > > > > > >>>>> roll
> > > > > > > >>>>> >> > > >> forward/back
as simple as doing a symlink
> > > switchover
> > > > or
> > > > > > > back.
> > > > > > > >>>>> I
> > > > > > > >>>>> >> > > >> have
> > > > > > > >>>>> >> > > >> to recommend
this to everyone who doesnt have a
> > > > > > management
> > > > > > > >>>>> scheme.
> > > > > > > >>>>> >> > > >>
> > > > > > > >>>>> >> > > >> On Thu,
Feb 10, 2011 at 4:20 PM, Ted Yu <
> > > > > > > yuzhihong@gmail.com
> > > > > > > >>>>> >
> > > > > > > >>>>> >> > > >> wrote:
> > > > > > > >>>>> >> > > >> > hbase/hbase-0.90.1.jar
leads
> > lib/hbase-0.90.0.jar
> > > > in
> > > > > > the
> > > > > > > >>>>> >> > > >> > classpath.
> > > > > > > >>>>> >> > > >> > I
wonder
> > > > > > > >>>>> >> > > >> > 1.
why hbase jar is placed in two directories
> -
> > > > > 0.20.6
> > > > > > > >>>>> didn't use
> > > > > > > >>>>> >> > such
> > > > > > > >>>>> >> > > >> > structure
> > > > > > > >>>>> >> > > >> > 2.
what from lib/hbase-0.90.0.jar could be
> > picked
> > > > up
> > > > > > and
> > > > > > > >>>>> why
> > > > > > > >>>>> >> > > >> > there
> > > > > > > >>>>> >> > > wasn't
> > > > > > > >>>>> >> > > >> > exception
in server log
> > > > > > > >>>>> >> > > >> >
> > > > > > > >>>>> >> > > >> > I
think a JIRA should be filed for item 2
> above
> > -
> > > > > bail
> > > > > > > out
> > > > > > > >>>>> when
> > > > > > > >>>>> >> > > >> > the
> > > > > > > >>>>> >> > > two
> > > > > > > >>>>> >> > > >> > hbase
jars from $HBASE_HOME and
> $HBASE_HOME/lib
> > > are
> > > > > of
> > > > > > > >>>>> different
> > > > > > > >>>>> >> > > >> versions.
> > > > > > > >>>>> >> > > >> >
> > > > > > > >>>>> >> > > >> > Cheers
> > > > > > > >>>>> >> > > >> >
> > > > > > > >>>>> >> > > >> > On
Thu, Feb 10, 2011 at 3:40 PM, Ryan Rawson
> <
> > > > > > > >>>>> ryanobjc@gmail.com>
> > > > > > > >>>>> >> > > wrote:
> > > > > > > >>>>> >> > > >> >
> > > > > > > >>>>> >> > > >> >>
What do you get when you:
> > > > > > > >>>>> >> > > >> >>
> > > > > > > >>>>> >> > > >> >>
ls lib/hbase*
> > > > > > > >>>>> >> > > >> >>
> > > > > > > >>>>> >> > > >> >>
I'm going to guess there is hbase-0.90.0.jar
> > > there
> > > > > > > >>>>> >> > > >> >>
> > > > > > > >>>>> >> > > >> >>
> > > > > > > >>>>> >> > > >> >>
> > > > > > > >>>>> >> > > >> >>
On Thu, Feb 10, 2011 at 3:25 PM, Ted Yu <
> > > > > > > >>>>> yuzhihong@gmail.com>
> > > > > > > >>>>> >> > wrote:
> > > > > > > >>>>> >> > > >> >>
> hbase-0.90.0-tests.jar and
> hbase-0.90.1.jar
> > > > > co-exist
> > > > > > > >>>>> >> > > >> >>
> Would this be a problem ?
> > > > > > > >>>>> >> > > >> >>
>
> > > > > > > >>>>> >> > > >> >>
> On Thu, Feb 10, 2011 at 3:16 PM, Ryan
> Rawson
> > > > > > > >>>>> >> > > >> >>
> <ryanobjc@gmail.com
> > > > > > > >>>>> >> > >
> > > > > > > >>>>> >> > > >> wrote:
> > > > > > > >>>>> >> > > >> >>
>
> > > > > > > >>>>> >> > > >> >>
>> You don't have both the old and the new
> > hbase
> > > > > jars
> > > > > > in
> > > > > > > >>>>> there
> > > > > > > >>>>> >> > > >> >>
>> do
> > > > > > > >>>>> >> > > you?
> > > > > > > >>>>> >> > > >> >>
>>
> > > > > > > >>>>> >> > > >> >>
>> -ryan
> > > > > > > >>>>> >> > > >> >>
>>
> > > > > > > >>>>> >> > > >> >>
>> On Thu, Feb 10, 2011 at 3:12 PM, Ted Yu <
> > > > > > > >>>>> yuzhihong@gmail.com>
> > > > > > > >>>>> >> > > wrote:
> > > > > > > >>>>> >> > > >> >>
>> > .META. went offline during second flow
> > > > attempt.
> > > > > > > >>>>> >> > > >> >>
>> >
> > > > > > > >>>>> >> > > >> >>
>> > The time out I mentioned happened for
> 1st
> > > and
> > > > > 3rd
> > > > > > > >>>>> attempts.
> > > > > > > >>>>> >> > > HBase
> > > > > > > >>>>> >> > > >> was
> > > > > > > >>>>> >> > > >> >>
>> > restarted before the 1st and 3rd
> > attempts.
> > > > > > > >>>>> >> > > >> >>
>> >
> > > > > > > >>>>> >> > > >> >>
>> > Here is jstack:
> > > > > > > >>>>> >> > > >> >>
>> > http://pastebin.com/EHMSvsRt
> > > > > > > >>>>> >> > > >> >>
>> >
> > > > > > > >>>>> >> > > >> >>
>> > On Thu, Feb 10, 2011 at 3:04 PM, Stack
> <
> > > > > > > >>>>> stack@duboce.net>
> > > > > > > >>>>> >> > > wrote:
> > > > > > > >>>>> >> > > >> >>
>> >
> > > > > > > >>>>> >> > > >> >>
>> >> So, .META. is not online?  What
> happens
> > if
> > > > you
> > > > > > use
> > > > > > > >>>>> shell
> > > > > > > >>>>> >> > > >> >>
>> >> at
> > > > > > > >>>>> >> > > this
> > > > > > > >>>>> >> > > >> >>
time.
> > > > > > > >>>>> >> > > >> >>
>> >>
> > > > > > > >>>>> >> > > >> >>
>> >> Your attachement did not come across
> > Ted.
> > > > >  Mind
> > > > > > > >>>>> >> > > >> >>
>> >> postbin'ing
> > > > > > > >>>>> >> > it?
> > > > > > > >>>>> >> > > >> >>
>> >>
> > > > > > > >>>>> >> > > >> >>
>> >> St.Ack
> > > > > > > >>>>> >> > > >> >>
>> >>
> > > > > > > >>>>> >> > > >> >>
>> >> On Thu, Feb 10, 2011 at 2:41 PM, Ted
> Yu
> > > > > > > >>>>> >> > > >> >>
>> >> <yuzhihong@gmail.com
> > > > > > > >>>>> >> > >
> > > > > > > >>>>> >> > > >> wrote:
> > > > > > > >>>>> >> > > >> >>
>> >> > I replaced hbase jar with
> > > hbase-0.90.1.jar
> > > > > > > >>>>> >> > > >> >>
>> >> > I also upgraded client side jar to
> > > > > > > >>>>> hbase-0.90.1.jar
> > > > > > > >>>>> >> > > >> >>
>> >> >
> > > > > > > >>>>> >> > > >> >>
>> >> > Our map tasks were running faster
> than
> > > > > before
> > > > > > > for
> > > > > > > >>>>> about
> > > > > > > >>>>> >> > > >> >>
>> >> > 50
> > > > > > > >>>>> >> > > >> minutes.
> > > > > > > >>>>> >> > > >> >>
>> >> However,
> > > > > > > >>>>> >> > > >> >>
>> >> > map tasks then timed out calling
> > > > > > flushCommits().
> > > > > > > >>>>> This
> > > > > > > >>>>> >> > > happened
> > > > > > > >>>>> >> > > >> even
> > > > > > > >>>>> >> > > >> >>
>> after
> > > > > > > >>>>> >> > > >> >>
>> >> > fresh restart of hbase.
> > > > > > > >>>>> >> > > >> >>
>> >> >
> > > > > > > >>>>> >> > > >> >>
>> >> > I don't see any exception in region
> > > server
> > > > > > logs.
> > > > > > > >>>>> >> > > >> >>
>> >> >
> > > > > > > >>>>> >> > > >> >>
>> >> > In master log, I found:
> > > > > > > >>>>> >> > > >> >>
>> >> >
> > > > > > > >>>>> >> > > >> >>
>> >> > 2011-02-10 18:24:15,286 DEBUG
> > > > > > > >>>>> >> > > >> >>
>> >> >
> > > > > > > >>>>> >> > > >> >>
>> >> >
> > > > > > > >>>>>
> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler:
> > > > > > > >>>>> >> > > >> Opened
> > > > > > > >>>>> >> > > >> >>
>> region
> > > > > > > >>>>> >> > > >> >>
>> >> > -ROOT-,,0.70236052 on
> > > sjc1-hadoop6.X.com
> > > > > > > >>>>> >> > ,60020,1297362251595
> > > > > > > >>>>> >> > > >> >>
>> >> > 2011-02-10 18:24:15,349 INFO
> > > > > > > >>>>> >> > > >> >>
>> >>
> > > > > org.apache.hadoop.hbase.catalog.CatalogTracker:
> > > > > > > >>>>> >> > > >> >>
>> >> > Failed verification of .META.,,1 at
> > > > > > > address=null;
> > > > > > > >>>>> >> > > >> >>
>> >> >
> > > > > > > org.apache.hadoop.hbase.NotServingRegionException:
> > > > > > > >>>>> >> > > >> >>
>> >> >
> > > > > > > org.apache.hadoop.hbase.NotServingRegionException:
> > > > > > > >>>>> >> > > >> >>
>> >> > Region
> > > > > > > >>>>> >> > is
> > > > > > > >>>>> >> > > not
> > > > > > > >>>>> >> > > >> >>
>> online:
> > > > > > > >>>>> >> > > >> >>
>> >> > .META.,,1
> > > > > > > >>>>> >> > > >> >>
>> >> > 2011-02-10 18:24:15,350 DEBUG
> > > > > > > >>>>> >> > > >> >>
>>
> org.apache.hadoop.hbase.zookeeper.ZKAssign:
> > > > > > > >>>>> >> > > >> >>
>> >> > master:60000-0x12e10d0e31e0000
> > Creating
> > > > (or
> > > > > > > >>>>> updating)
> > > > > > > >>>>> >> > > unassigned
> > > > > > > >>>>> >> > > >> >>
node
> > > > > > > >>>>> >> > > >> >>
>> for
> > > > > > > >>>>> >> > > >> >>
>> >> > 1028785192 with OFFLINE state
> > > > > > > >>>>> >> > > >> >>
>> >> >
> > > > > > > >>>>> >> > > >> >>
>> >> > I am attaching region server (which
> > > didn't
> > > > > > > respond
> > > > > > > >>>>> to
> > > > > > > >>>>> >> > > >> >>
stop-hbase.sh)
> > > > > > > >>>>> >> > > >> >>
>> >> jstack.
> > > > > > > >>>>> >> > > >> >>
>> >> >
> > > > > > > >>>>> >> > > >> >>
>> >> > FYI
> > > > > > > >>>>> >> > > >> >>
>> >> >
> > > > > > > >>>>> >> > > >> >>
>> >> > On Thu, Feb 10, 2011 at 10:10 AM,
> > Stack
> > > > > > > >>>>> >> > > >> >>
>> >> > <stack@duboce.net>
> > > > > > > >>>>> >> > > >> wrote:
> > > > > > > >>>>> >> > > >> >>
>> >> >>
> > > > > > > >>>>> >> > > >> >>
>> >> >> Thats probably enough Ted.  The
> > 0.90.1
> > > > > > > >>>>> >> > > >> >>
>> >> >> hbase-default.xml
> > > > > > > >>>>> >> > has
> > > > > > > >>>>> >> > > an
> > > > > > > >>>>> >> > > >> >>
extra
> > > > > > > >>>>> >> > > >> >>
>> >> >> config. to enable the experimental
> > > > > HBASE-3455
> > > > > > > >>>>> feature
> > > > > > > >>>>> >> > > >> >>
>> >> >> but
> > > > > > > >>>>> >> > > you
> > > > > > > >>>>> >> > > >> can
> > > > > > > >>>>> >> > > >> >>
>> copy
> > > > > > > >>>>> >> > > >> >>
>> >> >> that over if you want to try
> playing
> > > with
> > > > > it
> > > > > > > (it
> > > > > > > >>>>> >> > > >> >>
>> >> >> defaults
> > > > > > > >>>>> >> > > off
> > > > > > > >>>>> >> > > >> so
> > > > > > > >>>>> >> > > >> >>
>> you'd
> > > > > > > >>>>> >> > > >> >>
>> >> >> copy over the config. if you wanted
> > to
> > > > set
> > > > > it
> > > > > > > to
> > > > > > > >>>>> true).
> > > > > > > >>>>> >> > > >> >>
>> >> >>
> > > > > > > >>>>> >> > > >> >>
>> >> >> St.Ack
> > > > > > > >>>>> >> > > >> >>
>> >> >
> > > > > > > >>>>> >> > > >> >>
>> >> >
> > > > > > > >>>>> >> > > >> >>
>> >>
> > > > > > > >>>>> >> > > >> >>
>> >
> > > > > > > >>>>> >> > > >> >>
>>
> > > > > > > >>>>> >> > > >> >>
>
> > > > > > > >>>>> >> > > >> >>
> > > > > > > >>>>> >> > > >> >
> > > > > > > >>>>> >> > > >>
> > > > > > > >>>>> >> > > >
> > > > > > > >>>>> >> > >
> > > > > > > >>>>> >> >
> > > > > > > >>>>> >>
> > > > > > > >>>>> >>
> > > > > > > >>>>> >>
> > > > > > > >>>>> >> --
> > > > > > > >>>>> >> Todd Lipcon
> > > > > > > >>>>> >> Software Engineer, Cloudera
> > > > > > > >>>>> >
> > > > > > > >>>>> >
> > > > > > > >>>>>
> > > > > > > >>>>
> > > > > > > >>>>
> > > > > > > >>>
> > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Todd Lipcon
> > > > > > Software Engineer, Cloudera
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Todd Lipcon
> > > > Software Engineer, Cloudera
> > > >
> > >
> >
> >
> >
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
> >
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message