hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kristoffer Sjögren <sto...@gmail.com>
Subject Re: Phantom region server and PENDING_OPEN regions
Date Tue, 24 Nov 2015 10:21:21 GMT
The logs on the region server [1] is also quite interesting.

Before I restarted the cluster, the region server complains about
hijacked amb2.node.dc1.consul hijacked the regions from
amb2.service.consul.

2015-11-24 08:26:45,099 WARN  [RS_OPEN_META-amb2:16020-0]
zookeeper.ZKAssign: regionserver:16020-0x1513899be420000,
quorum=amb1.service.consul:2181, baseZNode=/hbase-unsecure Attempt to
transition the unassigned node for 1588230740 from M_ZK_REGION_OFFLINE
to RS_ZK_REGION_OPENING failed, the server that tried to transition
was amb2.node.dc1.consul,16020,1448353564099 not the expected
amb2.service.consul,16020,1448353564099
2015-11-24 08:26:45,099 WARN  [RS_OPEN_META-amb2:16020-0]
coordination.ZkOpenRegionCoordination: Failed transition from OFFLINE
to OPENING for region=1588230740
2015-11-24 08:26:45,099 WARN  [RS_OPEN_META-amb2:16020-0]
handler.OpenRegionHandler: Region was hijacked? Opening cancelled for
encodedName=1588230740
2015-11-24 08:26:45,100 INFO  [RS_OPEN_META-amb2:16020-0]
coordination.ZkOpenRegionCoordination: Opening of region {ENCODED =>
1588230740, NAME => 'hbase:meta,,1', STARTKEY => '', ENDKEY => ''}
failed, transitioning from OFFLINE to FAILED_OPEN in ZK, expecting
version 0
2015-11-24 08:26:45,101 WARN  [RS_OPEN_META-amb2:16020-0]
zookeeper.ZKAssign: regionserver:16020-0x1513899be420000,
quorum=amb1.service.consul:2181, baseZNode=/hbase-unsecure Attempt to
transition the unassigned node for 1588230740 from M_ZK_REGION_OFFLINE
to RS_ZK_REGION_FAILED_OPEN failed, the server that tried to
transition was amb2.node.dc1.consul,16020,1448353564099 not the
expected amb2.service.consul,16020,1448353564099


After editing resolv.conf and restarted the cluster it still complains
about amb2.node.dc1.consul trying to transition the regions instead of
amb2.service.consul.

2015-11-24 09:32:26,334 WARN  [RS_OPEN_META-amb2:16020-0]
zookeeper.ZKAssign: regionserver:16020-0x1513899be42000d,
quorum=amb1.service.consul:2181, baseZNode=/hbase-unsecure Attempt to
transition the unassigned node for 1588230740 from M_ZK_REGION_OFFLINE
to RS_ZK_REGION_OPENING failed, the server that tried to transition
was amb2.node.dc1.consul,16020,1448357534179 not the expected
amb2.service.consul,16020,1448357534179
2015-11-24 09:32:26,335 WARN  [RS_OPEN_META-amb2:16020-0]
coordination.ZkOpenRegionCoordination: Failed transition from OFFLINE
to OPENING for region=1588230740
2015-11-24 09:32:26,335 WARN  [RS_OPEN_META-amb2:16020-0]
handler.OpenRegionHandler: Region was hijacked? Opening cancelled for
encodedName=1588230740
2015-11-24 09:32:26,335 INFO  [RS_OPEN_META-amb2:16020-0]
coordination.ZkOpenRegionCoordination: Opening of region {ENCODED =>
1588230740, NAME => 'hbase:meta,,1', STARTKEY => '', ENDKEY => ''}
failed, transitioning from OFFLINE to FAILED_OPEN in ZK, expecting
version 2
2015-11-24 09:32:26,336 WARN  [RS_OPEN_META-amb2:16020-0]
zookeeper.ZKAssign: regionserver:16020-0x1513899be42000d,
quorum=amb1.service.consul:2181, baseZNode=/hbase-unsecure Attempt to
transition the unassigned node for 1588230740 from M_ZK_REGION_OFFLINE
to RS_ZK_REGION_FAILED_OPEN failed, the server that tried to
transition was amb2.node.dc1.consul,16020,1448357534179 not the
expected amb2.service.consul,16020,1448357534179


[1] http://pastebin.com/z93p8Mdu

On Tue, Nov 24, 2015 at 10:48 AM, Kristoffer Sjögren <stoffe@gmail.com> wrote:
> I removed the node.dc1.consul from resolve.conf and restarted the
> cluster but it still shows up on the master UI.
>
> amb2.node.dc1.consul,16020,1448353564099Tue Nov 24 08:26:04 UTC 201500
> amb2.service.consul,16020,1448353564099Tue Nov 24 08:26:04 UTC 201500
>
> The logs report [1] that the meta region fails to assign to
> node.dc1.consul and then tries to assign it to amb2.service.consul and
> gets stuck in PENDING_OPEN again.
>
> ---
> 1588230740hbase:meta,,1.1588230740 state=PENDING_OPEN, ts=Tue Nov 24
> 09:32:26 UTC 2015 (450s ago),
> server=amb2.service.consul,16020,1448357534179450511
> ---
>
> Before I restarted the cluster, the master log [2] complained about
> not being able to connect to amb2.node.dc1.consul/172.17.0.85:16020.
>
> Im not sure but somehow it feels as if amb2.node.dc1.consul shadows
> the real host amb2.service.consul.
>
> I was looking into the source code and found the configuration
> 'hbase.regionserver.hostname' - could that be of help here to remove
> the node.dc1 host?
>
> [1] http://pastebin.com/uZKqK9BJ
> [2] http://pastebin.com/s10E2rtA
>
> On Tue, Nov 24, 2015 at 10:23 AM, Samir Ahmic <ahmic.samir@gmail.com> wrote:
>> Hi Kristoffer,
>> It looks like you have some issue with name resolution. Try to remove
>> incorrect value from reslove.conf (node.dc1.consul) and then restart hbase
>> cluster.
>> Regarding issue with region in transition check master log for
>> "hbase:meta,,1.1588230740"
>> there should be exception explaining why hbase:meta can to be transition
>> from PENDING_OPEN to OPEN state, if hbase:meta table is unavailable master
>> can not finish initialization.
>>
>> Regards
>> Samir
>>
>> On Tue, Nov 24, 2015 at 10:11 AM, Kristoffer Sjögren <stoffe@gmail.com>
>> wrote:
>>
>>> Sorry, I should mention that this is HBase 1.1.2.
>>>
>>> Zookeeper only report one region server.
>>>
>>> $ ls /hbase-unsecure/rs
>>> [amb2.service.consul,16020,1448353564099]
>>>
>>>
>>>
>>>
>>> On Tue, Nov 24, 2015 at 9:55 AM, Kristoffer Sjögren <stoffe@gmail.com>
>>> wrote:
>>> > Hi
>>> >
>>> > I'm trying to install a HBase cluster with 1 master
>>> > (amb1.service.consul) and 1 region server (amb2.service.consul) using
>>> > Ambari on docker containers provided by sequenceiq [1] using a custom
>>> > blueprint [2].
>>> >
>>> > Every component installs correctly except for HBase which get stuck
>>> > with regions in transition:
>>> >
>>> > ---
>>> > hbase:meta,,1.1588230740 state=PENDING_OPEN, ts=Tue Nov 24 08:26:45
>>> > UTC 2015 (1098s ago), server=amb2.service.consul,16020,1448353564099
>>> > ---
>>> >
>>> > And for some reason 2 region servers (instead of 1) are discovered by
>>> > the master with the exact same timestamp but with different hostnames.
>>> > I'm not sure if this is the reason why the regions get stuck.
>>> >
>>> > ----
>>> > amb2.node.dc1.consul,16020,1448353564099Tue Nov 24 08:26:04 UTC 201500
>>> > amb2.service.consul,16020,1448353564099Tue Nov 24 08:26:04 UTC 201500
>>> > ----
>>> >
>>> > The only place I can find "amb2.node.dc1.consul" on the ambari
>>> > agent/server hosts is in /etc/resolv.conf which looks like this.
>>> >
>>> > ----
>>> > nameserver 172.17.0.82
>>> > search service.consul node.dc1.consul
>>> > ----
>>> >
>>> > Is there some way that I can manually tell the master to disregard the
>>> > "phantom" host amb2.node.dc1.consul?
>>> >
>>> > Any help or tips appreciated.
>>> >
>>> > Cheers,
>>> > -Kristoffer
>>> >
>>> >
>>> > [1] https://github.com/sequenceiq/docker-ambari
>>> > [2]
>>> https://gist.githubusercontent.com/krisskross/901ed8223c1ed1db80e3/raw/869327be9ad15e6a9f099a7591323244cd245357/ambari-hdp2.3
>>>

Mime
View raw message