hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: scanner lease expired/region server shutdown
Date Tue, 26 Jan 2010 03:23:49 GMT
I've seen this before, this is a GC pause that took more than a
minute. It was enough to time out the scanner leases and zookeeper
session.

J-D

On Mon, Jan 25, 2010 at 7:14 PM, Kannan Muthukkaruppan
<Kannan@facebook.com> wrote:
> I was doing some brute force testing - running one instance of PerformanceEvaluation
(PE) for writes, and another instance for randomReads.
>
> One of the region servers went down after a while. [This is on 0.20.2]. The region server
log had things like:
>
> 2010-01-25 11:33:39,416 WARN org.apache.hadoop.hbase.regionserver.HLog: IPC Server handler
34 on 60020 took 65190ms appending an edit to hlog; editcount=27878
> 2010-01-25 11:33:39,416 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: Cache Stats:
Sizes: Total=1815.4441MB (1903631104), Free=581.23096MB (609464832),\
>  Max=2396.675MB (2513095936), Counts: Blocks=28216, Access=10904689, Hit=9313612, Miss=1591077,
Evictions=110, Evicted=414336, Ratios: Hit Ratio=85.40924191474\
> 915%, Miss Ratio=14.590759575366974%, Evicted/Run=3766.69091796875
> 2010-01-25 11:33:39,417 WARN org.apache.hadoop.hbase.regionserver.HLog: IPC Server handler
52 on 60020 took 65195ms appending an edit to hlog; editcount=27879
> 2010-01-25 11:33:39,417 WARN org.apache.hadoop.hbase.regionserver.HLog: IPC Server handler
75 on 60020 took 65195ms appending an edit to hlog; editcount=27880
> 2010-01-25 11:33:39,417 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Scanner
-5220035164735014996 lease expired
>
> This was followed by:
> 2010-01-25 11:33:39,636 WARN org.apache.zookeeper.ClientCnxn: Exception closing session
0x12652c620520001 to sun.nio.ch.SelectionKeyImpl@788ab708<mailto:sun.nio.ch.SelectionKeyImpl@788ab708>
> java.io.IOException: TIMED OUT
>
> And also errors like:
> 2010-01-25 11:33:39,692 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer:
> org.apache.hadoop.hbase.UnknownScannerException: Name: -5220035164735014996
>
> Eventually, the region server shutdown.
> 2010-01-25 11:33:40,293 INFO org.apache.hadoop.ipc.HBaseServer: IPC Server handler 85
on 60020, call get([B@2eec23d2, row=0007555231, maxVersions=1, timeRange=[0,9223372036854775807),
families={(family=info, columns={data}}) from 10.131.1.186:51645: error: java.io.IOException:
Server not running, aborting
>
> Looks like for some reason the region server got overloaded, and was not able to finish
the scan (I am assuming this is the period meta/root region scan initiated by the master).
>
> Any thoughts/ideas on the above?
>
> Subsequently, I restarted the region server and just started the PE in sequentialWrite
mode. It seems to be off to a really slow start. For several regions, it is printing the following:
>
> 10/01/25 18:12:44 DEBUG client.HConnectionManager$TableServers: locateRegionInMeta attempt
0 of 10 failed; retrying after sleep of 1000 because: region offline: TestTable,0031841132,1264121653303
> 10/01/25 18:12:44 DEBUG client.HConnectionManager$TableServers: Removed .META.,,1 for
tableName=.META. from cache because of TestTable,0032000000,99999999999999
> 10/01/25 18:12:44 DEBUG client.HConnectionManager$TableServers: Cached location for .META.,,1
is 10.129.68.212:60020
> 10/01/25 18:12:45 DEBUG client.HConnectionManager$TableServers: locateRegionInMeta attempt
1 of 10 failed; retrying after sleep of 1000 because: region offline: TestTable,0031841132,1264121653303
> 10/01/25 18:12:45 DEBUG client.HConnectionManager$TableServers: Removed .META.,,1 for
tableName=.META. from cache because of TestTable,0032000000,99999999999999
> [..and so on..]
> 10/01/25 18:12:59 DEBUG client.HConnectionManager$TableServers: Cached location for .META.,,1
is 10.129.68.212:60020
> 10/01/25 18:13:07 DEBUG client.HConnectionManager$TableServers: locateRegionInMeta attempt
8 of 10 failed; retrying after sleep of 16000 because: region offline: TestTable,0031841132,1264121653303
> 10/01/25 18:13:07 DEBUG client.HConnectionManager$TableServers: Removed .META.,,1 for
tableName=.META. from cache because of TestTable,0032000000,99999999999999
>
> When I scan the '.META.', I noticed that most of the regions for which I get the above
error were in the middle of (?) a split. For example:
>
> TestTable,0031841132,126412 column=info:regioninfo, timestamp=1264470099796, value=REGION
=> {NAME => 'TestTa
>  1653303                     ble,0031841132,1264121653303', STARTKEY =>
'0031841132', ENDKEY => '0032057774',
>                             ENCODED => 644500792, OFFLINE => true,
SPLIT => true, TABLE => {{NAME => 'TestTab
>                             le', FAMILIES => [{NAME => 'info', VERSIONS
=> '3', COMPRESSION => 'NONE', TTL =>
>                              '2147483647', BLOCKSIZE => '65536', IN_MEMORY
=> 'false', BLOCKCACHE => 'true'}]
>                             }}
>  TestTable,0031841132,126412 column=info:server, timestamp=1264121655148, value=10.129.68.214:60020
>  1653303
>  TestTable,0031841132,126412 column=info:serverstartcode, timestamp=1264121655148, value=1264109117245
>  1653303
>  TestTable,0031841132,126412 column=info:splitA, timestamp=1264470099796, value=\000\n0031978284\000\000\000\0
>  1653303                     01&g\016\357\277\275\024\"TestTable,0031841132,1264449401108\000\n0031841132\000\
>                             000\000\005\tTestTable\000\000\000\000\000\002\000\000\000\aIS_ROOT\000\000\000\0
>                             05false\000\000\000\aIS_META\000\000\000\005false\000\000\000\001\a\004info\000\0
>                             00\000\a\000\000\000\vBLOOMFILTER\000\000\000\005false\000\000\000\vCOMPRESSION\0
>                             00\000\000\004NONE\000\000\000\bVERSIONS\000\000\000\0013\000\000\000\003TTL\000\
>                             000\000\n2147483647\000\000\000\tBLOCKSIZE\000\000\000\00565536\000\000\000\tIN_M
>                             EMORY\000\000\000\005false\000\000\000\nBLOCKCACHE\000\000\000\004true\357\277\27
>                             5\337\274#
>  TestTable,0031841132,126412 column=info:splitB, timestamp=1264470099796, value=\000\n0032057774\000\000\000\0
>  1653303                     01&g\016\357\277\275\024\"TestTable,0031978284,1264449401108\000\n0031978284\000\
>                             000\000\005\tTestTable\000\000\000\000\000\002\000\000\000\aIS_ROOT\000\000\000\0
>                             05false\000\000\000\aIS_META\000\000\000\005false\000\000\000\001\a\004info\000\0
>                             00\000\a\000\000\000\vBLOOMFILTER\000\000\000\005false\000\000\000\vCOMPRESSION\0
>                             00\000\000\004NONE\000\000\000\bVERSIONS\000\000\000\0013\000\000\000\003TTL\000\
>                             000\000\n2147483647\000\000\000\tBLOCKSIZE\000\000\000\00565536\000\000\000\tIN_M
>                             EMORY\000\000\000\005false\000\000\000\nBLOCKCACHE\000\000\000\004truej\357\277\2
>                             75\312\262
>
> Any thoughts/ideas on what might be going on? Appreciate any input in helping understand
what might be going on.
>
> Regards,
> Kannan
>
>
>

Mime
View raw message