hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: initial experience with HBase 0.90.1 rc0
Date Mon, 14 Feb 2011 17:20:00 GMT
I disabled MSLAB.
My flow still couldn't make much progress.


> In this region server stack trace, I don't see
> MemStoreFlusher.reclaimMemStoreMemory() call:
> http://pastebin.com/uiBRidUa
>
>
> On Sun, Feb 13, 2011 at 1:14 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>
>> I am using hadoop-core-0.20.2-322.jar downloaded from Ryan's repo.
>> FYI
>>
>>
>> On Sun, Feb 13, 2011 at 1:12 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>>
>>> Since master server shut down, I restarted the cluster.
>>> The next flow over 200GB data got timed out.
>>>
>>> Here are some region server stat:
>>>
>>> request=0.0, regions=95, stores=213, storefiles=65,
>>> storefileIndexSize=99, memstoreSize=1311, compactionQueueSize=0,
>>> flushQueueSize=0, usedHeap=2532, maxHeap=3983, blockCacheSize=6853968,
>>> blockCacheFree=828520304, blockCacheCount=0, blockCacheHitCount=0,
>>> blockCacheMissCount=0, blockCacheEvictedCount=0, blockCacheHitRatio=0,
>>> blockCacheHitCachingRatio=0
>>>
>>> request=0.0, regions=95, stores=232, storefiles=72,
>>> storefileIndexSize=120, memstoreSize=301, compactionQueueSize=0,
>>> flushQueueSize=0, usedHeap=1740, maxHeap=3983, blockCacheSize=13110928,
>>> blockCacheFree=822263344, blockCacheCount=712, blockCacheHitCount=112478,
>>> blockCacheMissCount=712, blockCacheEvictedCount=0, blockCacheHitRatio=99,
>>> blockCacheHitCachingRatio=99
>>>
>>> Thanks
>>>
>>>
>>> On Sun, Feb 13, 2011 at 12:24 PM, Ryan Rawson <ryanobjc@gmail.com>wrote:
>>>
>>>> every handler thread, and every reader and also the accept thread are
>>>> all blocked on flushing memstore.  The handlers get blocked, then the
>>>> readers also have a finite handoff queue and they are blocked and also
>>>> the accept.
>>>>
>>>> But why isnt memstore flushing?  Do you have regionserver stats?  ie:
>>>> how much memstore global ram used?  That is found on the main page of
>>>> the regionserver http service, also found in ganglia/file stats.
>>>>
>>>> I havent looked at the logs yet, I'm off to lunch now.
>>>>
>>>> -ryan
>>>>
>>>> On Sun, Feb 13, 2011 at 8:44 AM, Ted Yu <yuzhihong@gmail.com> wrote:
>>>> > I had 3 consecutive successful runs processing 200GB data for each run
>>>> > before hitting timeout problem in the 4th run.
>>>> >
>>>> > The 5th run couldn't proceed because master complained:
>>>> >
>>>> > 2011-02-13 16:11:45,173 FATAL org.apache.hadoop.hbase.master.HMaster:
>>>> Failed
>>>> > assignment of regions to
>>>> > serverName=sjc1-hadoop6.sjc1.carrieriq.com,60020,1297518996557,
>>>> > load=(requests=0, regions=231, usedHeap=3535, maxHeap=3983)
>>>> >
>>>> > but sjc1-hadoop6.sjc1 claimed:
>>>> > 2011-02-13 16:13:32,258 DEBUG
>>>> > org.apache.hadoop.hbase.regionserver.HRegionServer: No master found,
>>>> will
>>>> > retry
>>>> >
>>>> > Here is stack trace for sjc1-hadoop6.sjc1:
>>>> > http://pastebin.com/X8zWLXqu
>>>> >
>>>> > I didn't have chance to capture master stack trace as master exited
>>>> after
>>>> > that.
>>>> >
>>>> > I also attach master and region server log on sjc1-hadoop6.sjc1 -
>>>> pardon me
>>>> > for including individual email addresses as attachments wouldn't go
>>>> through
>>>> > hbase.apache.org
>>>> >
>>>> > On Thu, Feb 10, 2011 at 5:05 PM, Todd Lipcon <todd@cloudera.com>
>>>> wrote:
>>>> >>
>>>> >> On Thu, Feb 10, 2011 at 4:54 PM, Ted Yu <yuzhihong@gmail.com>
wrote:
>>>> >>
>>>> >> > Thanks for the explanation.
>>>> >> > Assuming the mixed class loading is static, why did this situation
>>>> >> > develop
>>>> >> > after 40 minutes of heavy load :-(
>>>> >> >
>>>> >>
>>>> >> You didn't hit global memstore pressure until 40 minutes of load.
>>>> >>
>>>> >> -Todd
>>>> >>
>>>> >> On Thu, Feb 10, 2011 at 4:42 PM, Ryan Rawson <ryanobjc@gmail.com>
>>>> wrote:
>>>> >> >
>>>> >> > > It's a standard linking issue, you get one class from
one version
>>>> >> > > another from another, they are mostly compatible in terms
of
>>>> >> > > signatures (hence no exceptions) but are subtly incompatible
in
>>>> >> > > different ways. In the stack trace you posted, the handlers
were
>>>> >> > > blocked in:
>>>> >> > >
>>>> >> > >        at
>>>> >> > >
>>>> >> >
>>>> >> >
>>>> org.apache.hadoop.hbase.regionserver.MemStoreFlusher.reclaimMemStoreMemory(MemStoreFlusher.java:382)
>>>> >> > >
>>>> >> > > and the thread:
>>>> >> > >
>>>> >> > > "regionserver60020.cacheFlusher" daemon prio=10
>>>> tid=0x00002aaabc21e000
>>>> >> > > nid=0x7717 waiting for monitor entry [0x0000000000000000]
>>>> >> > >   java.lang.Thread.State: BLOCKED (on object monitor)
>>>> >> > >
>>>> >> > > was idle.
>>>> >> > >
>>>> >> > > The cache flusher thread should be flushing, and yet it's
doing
>>>> >> > > nothing.  This also happens to be one of the classes that
were
>>>> >> > > changed.
>>>> >> > >
>>>> >> > >
>>>> >> > >
>>>> >> > > On Thu, Feb 10, 2011 at 4:34 PM, Ted Yu <yuzhihong@gmail.com>
>>>> wrote:
>>>> >> > > > Can someone comment on my second question ?
>>>> >> > > > Thanks
>>>> >> > > >
>>>> >> > > > On Thu, Feb 10, 2011 at 4:25 PM, Ryan Rawson <
>>>> ryanobjc@gmail.com>
>>>> >> > wrote:
>>>> >> > > >
>>>> >> > > >> As I suspected.
>>>> >> > > >>
>>>> >> > > >> It's a byproduct of our maven assembly process.
The process
>>>> could
>>>> >> > > >> be
>>>> >> > > >> fixed. I wouldn't mind. I don't support runtime
checking of
>>>> jars,
>>>> >> > > >> there is such thing as too much tests, and this
is an example
>>>> of
>>>> >> > > >> it.
>>>> >> > > >> The check would then need a test, etc, etc.
>>>> >> > > >>
>>>> >> > > >> At SU we use new directories for each upgrade,
copying the
>>>> config
>>>> >> > > >> over. With the lack of -default.xml this is easier
than ever
>>>> (just
>>>> >> > > >> copy everything in conf/).  With symlink switchover
it makes
>>>> roll
>>>> >> > > >> forward/back as simple as doing a symlink switchover
or back.
>>>> I
>>>> >> > > >> have
>>>> >> > > >> to recommend this to everyone who doesnt have
a management
>>>> scheme.
>>>> >> > > >>
>>>> >> > > >> On Thu, Feb 10, 2011 at 4:20 PM, Ted Yu <yuzhihong@gmail.com>
>>>> >> > > >> wrote:
>>>> >> > > >> > hbase/hbase-0.90.1.jar leads lib/hbase-0.90.0.jar
in the
>>>> >> > > >> > classpath.
>>>> >> > > >> > I wonder
>>>> >> > > >> > 1. why hbase jar is placed in two directories
- 0.20.6
>>>> didn't use
>>>> >> > such
>>>> >> > > >> > structure
>>>> >> > > >> > 2. what from lib/hbase-0.90.0.jar could
be picked up and why
>>>> >> > > >> > there
>>>> >> > > wasn't
>>>> >> > > >> > exception in server log
>>>> >> > > >> >
>>>> >> > > >> > I think a JIRA should be filed for item
2 above - bail out
>>>> when
>>>> >> > > >> > the
>>>> >> > > two
>>>> >> > > >> > hbase jars from $HBASE_HOME and $HBASE_HOME/lib
are of
>>>> different
>>>> >> > > >> versions.
>>>> >> > > >> >
>>>> >> > > >> > Cheers
>>>> >> > > >> >
>>>> >> > > >> > On Thu, Feb 10, 2011 at 3:40 PM, Ryan Rawson
<
>>>> ryanobjc@gmail.com>
>>>> >> > > wrote:
>>>> >> > > >> >
>>>> >> > > >> >> What do you get when you:
>>>> >> > > >> >>
>>>> >> > > >> >> ls lib/hbase*
>>>> >> > > >> >>
>>>> >> > > >> >> I'm going to guess there is hbase-0.90.0.jar
there
>>>> >> > > >> >>
>>>> >> > > >> >>
>>>> >> > > >> >>
>>>> >> > > >> >> On Thu, Feb 10, 2011 at 3:25 PM, Ted
Yu <
>>>> yuzhihong@gmail.com>
>>>> >> > wrote:
>>>> >> > > >> >> > hbase-0.90.0-tests.jar and hbase-0.90.1.jar
co-exist
>>>> >> > > >> >> > Would this be a problem ?
>>>> >> > > >> >> >
>>>> >> > > >> >> > On Thu, Feb 10, 2011 at 3:16 PM,
Ryan Rawson
>>>> >> > > >> >> > <ryanobjc@gmail.com
>>>> >> > >
>>>> >> > > >> wrote:
>>>> >> > > >> >> >
>>>> >> > > >> >> >> You don't have both the old
and the new hbase jars in
>>>> there
>>>> >> > > >> >> >> do
>>>> >> > > you?
>>>> >> > > >> >> >>
>>>> >> > > >> >> >> -ryan
>>>> >> > > >> >> >>
>>>> >> > > >> >> >> On Thu, Feb 10, 2011 at 3:12
PM, Ted Yu <
>>>> yuzhihong@gmail.com>
>>>> >> > > wrote:
>>>> >> > > >> >> >> > .META. went offline during
second flow attempt.
>>>> >> > > >> >> >> >
>>>> >> > > >> >> >> > The time out I mentioned
happened for 1st and 3rd
>>>> attempts.
>>>> >> > > HBase
>>>> >> > > >> was
>>>> >> > > >> >> >> > restarted before the 1st
and 3rd attempts.
>>>> >> > > >> >> >> >
>>>> >> > > >> >> >> > Here is jstack:
>>>> >> > > >> >> >> > http://pastebin.com/EHMSvsRt
>>>> >> > > >> >> >> >
>>>> >> > > >> >> >> > On Thu, Feb 10, 2011 at
3:04 PM, Stack <
>>>> stack@duboce.net>
>>>> >> > > wrote:
>>>> >> > > >> >> >> >
>>>> >> > > >> >> >> >> So, .META. is not
online?  What happens if you use
>>>> shell
>>>> >> > > >> >> >> >> at
>>>> >> > > this
>>>> >> > > >> >> time.
>>>> >> > > >> >> >> >>
>>>> >> > > >> >> >> >> Your attachement did
not come across Ted.  Mind
>>>> >> > > >> >> >> >> postbin'ing
>>>> >> > it?
>>>> >> > > >> >> >> >>
>>>> >> > > >> >> >> >> St.Ack
>>>> >> > > >> >> >> >>
>>>> >> > > >> >> >> >> On Thu, Feb 10, 2011
at 2:41 PM, Ted Yu
>>>> >> > > >> >> >> >> <yuzhihong@gmail.com
>>>> >> > >
>>>> >> > > >> wrote:
>>>> >> > > >> >> >> >> > I replaced hbase
jar with hbase-0.90.1.jar
>>>> >> > > >> >> >> >> > I also upgraded
client side jar to hbase-0.90.1.jar
>>>> >> > > >> >> >> >> >
>>>> >> > > >> >> >> >> > Our map tasks
were running faster than before for
>>>> about
>>>> >> > > >> >> >> >> > 50
>>>> >> > > >> minutes.
>>>> >> > > >> >> >> >> However,
>>>> >> > > >> >> >> >> > map tasks then
timed out calling flushCommits().
>>>> This
>>>> >> > > happened
>>>> >> > > >> even
>>>> >> > > >> >> >> after
>>>> >> > > >> >> >> >> > fresh restart
of hbase.
>>>> >> > > >> >> >> >> >
>>>> >> > > >> >> >> >> > I don't see any
exception in region server logs.
>>>> >> > > >> >> >> >> >
>>>> >> > > >> >> >> >> > In master log,
I found:
>>>> >> > > >> >> >> >> >
>>>> >> > > >> >> >> >> > 2011-02-10 18:24:15,286
DEBUG
>>>> >> > > >> >> >> >> >
>>>> >> > > >> >> >> >> >
>>>> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler:
>>>> >> > > >> Opened
>>>> >> > > >> >> >> region
>>>> >> > > >> >> >> >> > -ROOT-,,0.70236052
on sjc1-hadoop6.X.com
>>>> >> > ,60020,1297362251595
>>>> >> > > >> >> >> >> > 2011-02-10 18:24:15,349
INFO
>>>> >> > > >> >> >> >> org.apache.hadoop.hbase.catalog.CatalogTracker:
>>>> >> > > >> >> >> >> > Failed verification
of .META.,,1 at address=null;
>>>> >> > > >> >> >> >> > org.apache.hadoop.hbase.NotServingRegionException:
>>>> >> > > >> >> >> >> > org.apache.hadoop.hbase.NotServingRegionException:
>>>> >> > > >> >> >> >> > Region
>>>> >> > is
>>>> >> > > not
>>>> >> > > >> >> >> online:
>>>> >> > > >> >> >> >> > .META.,,1
>>>> >> > > >> >> >> >> > 2011-02-10 18:24:15,350
DEBUG
>>>> >> > > >> >> >> org.apache.hadoop.hbase.zookeeper.ZKAssign:
>>>> >> > > >> >> >> >> > master:60000-0x12e10d0e31e0000
Creating (or
>>>> updating)
>>>> >> > > unassigned
>>>> >> > > >> >> node
>>>> >> > > >> >> >> for
>>>> >> > > >> >> >> >> > 1028785192 with
OFFLINE state
>>>> >> > > >> >> >> >> >
>>>> >> > > >> >> >> >> > I am attaching
region server (which didn't respond
>>>> to
>>>> >> > > >> >> stop-hbase.sh)
>>>> >> > > >> >> >> >> jstack.
>>>> >> > > >> >> >> >> >
>>>> >> > > >> >> >> >> > FYI
>>>> >> > > >> >> >> >> >
>>>> >> > > >> >> >> >> > On Thu, Feb 10,
2011 at 10:10 AM, Stack
>>>> >> > > >> >> >> >> > <stack@duboce.net>
>>>> >> > > >> wrote:
>>>> >> > > >> >> >> >> >>
>>>> >> > > >> >> >> >> >> Thats probably
enough Ted.  The 0.90.1
>>>> >> > > >> >> >> >> >> hbase-default.xml
>>>> >> > has
>>>> >> > > an
>>>> >> > > >> >> extra
>>>> >> > > >> >> >> >> >> config. to
enable the experimental HBASE-3455
>>>> feature
>>>> >> > > >> >> >> >> >> but
>>>> >> > > you
>>>> >> > > >> can
>>>> >> > > >> >> >> copy
>>>> >> > > >> >> >> >> >> that over
if you want to try playing with it (it
>>>> >> > > >> >> >> >> >> defaults
>>>> >> > > off
>>>> >> > > >> so
>>>> >> > > >> >> >> you'd
>>>> >> > > >> >> >> >> >> copy over
the config. if you wanted to set it to
>>>> true).
>>>> >> > > >> >> >> >> >>
>>>> >> > > >> >> >> >> >> St.Ack
>>>> >> > > >> >> >> >> >
>>>> >> > > >> >> >> >> >
>>>> >> > > >> >> >> >>
>>>> >> > > >> >> >> >
>>>> >> > > >> >> >>
>>>> >> > > >> >> >
>>>> >> > > >> >>
>>>> >> > > >> >
>>>> >> > > >>
>>>> >> > > >
>>>> >> > >
>>>> >> >
>>>> >>
>>>> >>
>>>> >>
>>>> >> --
>>>> >> Todd Lipcon
>>>> >> Software Engineer, Cloudera
>>>> >
>>>> >
>>>>
>>>
>>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message