Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (athena.apache.org: domain of cjjvictory@gmail.com
 designates 209.85.161.41 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type:content-transfer-encoding;
        b=ipCKh6H6OCOGx8mVobJUPlaxH30RugApGUweLi4JhIvcDwQi2rzAdyh1lYqxFIY88G
         zcV+amL09Lxi8rPuAMGogMnK2werjQxDE08uO53liqKxlZMRmaBNGxHhenqlcXeAwLUX
         IhNrIs5l6DVmBqi+n6chVVke2fXQnHNwUJpUQ=
MIME-Version: 1.0
In-Reply-To: <AANLkTiktBef8SFMwgUkPcMPEm5moMQCZUwE0jCgJAwfh@mail.gmail.com>
References: <AANLkTi=KE+_5QJz2JCrWU-e8aLF+P02vJB7s44zFWcoS@mail.gmail.com>
	<AANLkTimr+Fok1WPRZ-K=oHE6rs7sO74XA23p9Pjn2=nk@mail.gmail.com>
	<AANLkTi=ht06U7JM+UM5uN33og2CSZX-v7ypV+uHf0rEZ@mail.gmail.com>
	<AANLkTinhKNSCg4dXrCeAcwuBgSUjcY63QCm8-ZEW7uVV@mail.gmail.com>
	<AANLkTikO=_uhLKbtbxCY1Z=MyooQQknetAce51jUL-7n@mail.gmail.com>
	<AANLkTinkkryvyCVJaCdtD8h-qN_p55TtJzYoQJGRd8O-@mail.gmail.com>
	<AANLkTim3cx8YgW4SWUB_TPXpktPAMr43Eu+HFoMfTW20@mail.gmail.com>
	<AANLkTimK8ECxrD4LkuhcB6i=djhQCeXi2bviZcNxL0kg@mail.gmail.com>
	<AANLkTiktBef8SFMwgUkPcMPEm5moMQCZUwE0jCgJAwfh@mail.gmail.com>
Date: Wed, 9 Mar 2011 00:21:41 +0800
Message-ID: <AANLkTikS7EiNXi-i62s2AbzZwtbCKouMS0UoZh=KL94C@mail.gmail.com>
Subject: Re: region servers dying - flush request - YCSB
From: =?UTF-8?B?6ZmI5Yqg5L+K?= <cjjvictory@gmail.com>
To: user@hbase.apache.org
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Htable had disabled when ctr+c ?

2011/3/8, M.Deniz OKTAR <deniz.oktar@gmail.com>:
> Something new came up!
>
> I tried to truncate the 'usertable' which had ~12M entries.
>
> Shell stayed at "disabling table" for a long time. The processes was ther=
e
> but there were no requests. So I quit the state by ctrl-c.
>
> Then tried count 'usertable' to see if data remains, shell gave an error =
and
> one of the regionservers had a log such as below,
>
> The master logs were also similar (tried to disable again, and the master
> log is from that trial)
>
>
> Regionserver 2:
>
> 2011-03-08 16:47:24,852 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegionServer:
> NotServingRegionException; Region is not online:
> usertable,,1299593459085.d37bb124feaf8f5d08e51064a36596f8.
> 2011-03-08 16:47:27,765 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=3D39.63 =
MB,
> free=3D4.65 GB, max=3D4.68 GB, blocks=3D35, accesses=3D376070, hits=3D120=
35,
> hitRatio=3D3.20%%, cachingAccesses=3D12070, cachingHits=3D12035,
> cachingHitsRatio=3D99.71%%, evictions=3D0, evicted=3D0, evictedPerRun=3DN=
aN
> 2011-03-08 16:47:28,863 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegionServer:
> NotServingRegionException; Region is not online:
> usertable,,1299593459085.d37bb124feaf8f5d08e51064a36596f8.
> 2011-03-08 16:47:28,865 ERROR
> org.apache.hadoop.hbase.regionserver.HRegionServer:
> org.apache.hadoop.hbase.UnknownScannerException: Name: -1
>         at
> org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.jav=
a:1795)
>         at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorI=
mpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at
> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
>         at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039=
)
>
>
>
> Masterserver:
> .
> .
> . (same thing)
> 2011-03-08 16:51:34,679 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Region has been
> PENDING_CLOSE for too long, running forced unassign again on
> region=3Dusertable,user1948102037,1299592536693.d5bae6bbe54aa182e1215ab62=
6e0011e.
>
>
> --
> deniz
>
>
> On Tue, Mar 8, 2011 at 4:34 PM, M.Deniz OKTAR <deniz.oktar@gmail.com> wro=
te:
>
>> Hi all,
>>
>> Thanks for the support. I'v been trying to replicate the problem since
>> this
>> morning. Before doing that, played with the configuration. I used to hav=
e
>> only one user and set all the permissions according to that. Now I'v
>> followed the cloudera manuals and set permissions for hdfs and mapred
>> users.
>> (changed the hbase-env.sh)
>>
>> I had 2 trials, on both the yahoo test failed because of receiving lost =
of
>> "0"s but the region servers didn't die. At some points in the test, (als=
o
>> when failed) , hbase master gave exceptions about not being able to reac=
h
>> one of the servers. I also lost the ssh connection to that server, but
>> after
>> a while it recovered. (also hmaster) The last thing in the regionserver
>> logs
>> was that it was going for a flush.
>>
>> I'll be going over the tests again and provide you with clean log files
>> from all servers. (hadoop, hbase, namenode, masternode logs)
>>
>> If you have any suggestions or directions for me to better diagnose the
>> problem, that would be lovely.
>>
>> btw: these servers do not have ECC memory but I do not see any corruptio=
n
>> in data.
>>
>> Thanks!
>>
>> --
>> deniz
>>
>>
>> On Mon, Mar 7, 2011 at 7:47 PM, Jean-Daniel Cryans
>> <jdcryans@apache.org>wrote:
>>
>>> Along with a bigger portion of the log, it be might good to check if
>>> there's anything in the .out file that looks like a jvm error.
>>>
>>> J-D
>>>
>>> On Mon, Mar 7, 2011 at 9:22 AM, M.Deniz OKTAR <deniz.oktar@gmail.com>
>>> wrote:
>>> > I run every kind of benchmark I could find on those machines and they
>>> seemed
>>> > to work fine. Did memory/disk tests too.
>>> >
>>> > The master node or other nodes provide some information and exception=
s
>>> about
>>> > that they can't reach to the dead node.
>>> >
>>> > Btw sometimes the process does not die but looses the connection.
>>> >
>>> > --
>>> >
>>> > deniz
>>> >
>>> > On Mon, Mar 7, 2011 at 7:19 PM, Stack <stack@duboce.net> wrote:
>>> >
>>> >> I'm stumped.  I have nothing to go on when no death throes or
>>> >> complaints.  This hardware for sure is healthy?  Other stuff runs w/=
o
>>> >> issue?
>>> >> St.Ack
>>> >>
>>> >> On Mon, Mar 7, 2011 at 8:48 AM, M.Deniz OKTAR <deniz.oktar@gmail.com=
>
>>> >> wrote:
>>> >> > I don't know if its normal but I see alot of '0's in the test
>>> >> > results
>>> >> when
>>> >> > it tends to fail, such as:
>>> >> >
>>> >> >  1196 sec: 7394901 operations; 0 current ops/sec;
>>> >> >
>>> >> > --
>>> >> > deniz
>>> >> >
>>> >> > On Mon, Mar 7, 2011 at 6:46 PM, M.Deniz OKTAR <deniz.oktar@gmail.c=
om
>>> >
>>> >> wrote:
>>> >> >
>>> >> >> Hi,
>>> >> >>
>>> >> >> Thanks for the effort, answers below:
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> On Mon, Mar 7, 2011 at 6:08 PM, Stack <stack@duboce.net> wrote:
>>> >> >>
>>> >> >>> On Mon, Mar 7, 2011 at 5:43 AM, M.Deniz OKTAR <
>>> deniz.oktar@gmail.com>
>>> >> >>> wrote:
>>> >> >>> > We have a 5 node cluster, 4 of them being region servers. I am
>>> >> running a
>>> >> >>> > custom workload with YCSB and when the data is loading (heavy
>>> insert)
>>> >> at
>>> >> >>> > least one of the region servers are dying after about 600000
>>> >> operations.
>>> >> >>>
>>> >> >>>
>>> >> >>> Tell us the character of your 'custom workload' please.
>>> >> >>>
>>> >> >>>
>>> >> >> The workload is below, the part that fails is the loading part
>>> (-load)
>>> >> >> which inserts all the records first)
>>> >> >>
>>> >> >> recordcount=3D10000000
>>> >> >> operationcount=3D3000000
>>> >> >> workload=3Dcom.yahoo.ycsb.workloads.CoreWorkload
>>> >> >>
>>> >> >> readallfields=3Dtrue
>>> >> >>
>>> >> >> readproportion=3D0.5
>>> >> >> updateproportion=3D0.1
>>> >> >> scanproportion=3D0
>>> >> >> insertproportion=3D0.35
>>> >> >> readmodifywriteproportion=3D0.05
>>> >> >>
>>> >> >> requestdistribution=3Dzipfian
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>>
>>> >> >>> > There are no abnormalities in the logs as far as I can see, th=
e
>>> only
>>> >> >>> common
>>> >> >>> > point is that all of them(in different trials, different regio=
n
>>> >> servers
>>> >> >>> > fail) request for a flush as the last logs, given below. .out
>>> files
>>> >> are
>>> >> >>> > empty. I am looking at the /var/log/hbase folder for logs.
>>> Running
>>> >> sun
>>> >> >>> java
>>> >> >>> > 6 latest version. I couldn't find any logs that indicates a
>>> problem
>>> >> with
>>> >> >>> > java. Tried the tests with openjdk and had the same results.
>>> >> >>> >
>>> >> >>>
>>> >> >>> Its strange that flush is the last thing in your log.  The proce=
ss
>>> is
>>> >> >>> dead?  We are exiting w/o a note in logs?  Thats unusual.  We
>>> usually
>>> >> >>> scream loudly when dying.
>>> >> >>>
>>> >> >>
>>> >> >> Yes, thats the strange part. The last line is a flush as if the
>>> process
>>> >> >> never failed. Yes, the process is dead and hbase cannot see the
>>> node.
>>> >> >>
>>> >> >>
>>> >> >>>
>>> >> >>> > I have set ulimits(50000) and xceivers(20000) for multiple use=
rs
>>> and
>>> >> >>> certain
>>> >> >>> > that they are correct.
>>> >> >>>
>>> >> >>> The first line in an hbase log prints out the ulimit it sees.  Y=
ou
>>> >> >>> might check that the hbase process for sure is picking up your
>>> ulimit
>>> >> >>> setting.
>>> >> >>>
>>> >> >>> That was a mistake I did a couple of days ago, checked it with c=
at
>>> >> >> /proc/<pid of reginserver>/limits  and all related users like
>>> 'hbase'
>>> >> has
>>> >> >> those limits. Checked the logs:
>>> >> >>
>>> >> >> Mon Mar  7 06:41:15 EET 2011 Starting regionserver on test-1
>>> >> >> ulimit -n 52768
>>> >> >>
>>> >> >>>
>>> >> >>> > Also in the kernel logs, there are no apparent problems.
>>> >> >>> >
>>> >> >>>
>>> >> >>> (The mystery compounds)
>>> >> >>>
>>> >> >>> > 2011-03-07 15:07:58,301 DEBUG
>>> >> >>> > org.apache.hadoop.hbase.regionserver.CompactSplitThread:
>>> Compaction
>>> >> >>> > requested for
>>> >> >>> >
>>> >>
>>> usertable,user1030079237,1299502934627.257739740f58da96d5c5ef51a7d3efc3=
.
>>> >> >>> > because regionserver60020.cacheFlusher; priority=3D3, compacti=
on
>>> queue
>>> >> >>> size=3D18
>>> >> >>> > 2011-03-07 15:07:58,301 DEBUG
>>> >> >>> org.apache.hadoop.hbase.regionserver.HRegion:
>>> >> >>> > NOT flushing memstore for region
>>> >> >>> >
>>> >> >>>
>>> >>
>>> usertable,user1601881548,1299502135191.f8efb9aa0922fa8a6a53fc49b8155ebc=
.,
>>> >> >>> > flushing=3Dfalse, writesEnabled=3Dfalse
>>> >> >>> > 2011-03-07 15:07:58,301 DEBUG
>>> >> >>> org.apache.hadoop.hbase.regionserver.HRegion:
>>> >> >>> > Started memstore flush for
>>> >> >>> >
>>> >> >>>
>>> >>
>>> usertable,user1662209069,1299502135191.9fa929e6fb439843cffb604dea3f88f6=
.,
>>> >> >>> > current region memstore size 68.6m
>>> >> >>> > 2011-03-07 15:07:58,310 DEBUG
>>> >> >>> org.apache.hadoop.hbase.regionserver.HRegion:
>>> >> >>> > Flush requested on
>>> >> >>> >
>>> >>
>>> usertable,user1601881548,1299502135191.f8efb9aa0922fa8a6a53fc49b8155ebc=
.
>>> >> >>> > -end of log file-
>>> >> >>> > ---
>>> >> >>> >
>>> >> >>>
>>> >> >>> Nothing more?
>>> >> >>>
>>> >> >>>
>>> >> >> No, nothing after that. But quite a lot of logs before that, I ca=
n
>>> send
>>> >> >> them if you'd like.
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>> Thanks,
>>> >> >>> St.Ack
>>> >> >>>
>>> >> >>
>>> >> >> Thanks alot!
>>> >> >>
>>> >> >>
>>> >> >
>>> >>
>>> >
>>>
>>
>>
>

--=20
=E4=BB=8E=E6=88=91=E7=9A=84=E7=A7=BB=E5=8A=A8=E8=AE=BE=E5=A4=87=E5=8F=91=E9=
=80=81

Thanks & Best regards
jiajun