hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Samir Ahmic <ahmic.sa...@gmail.com>
Subject Re: Phantom region server and PENDING_OPEN regions
Date Tue, 24 Nov 2015 12:21:40 GMT
Your hosts file looks fine. If i understand correctly value of $HOSTNAME
env variable is  *.node.dc1.consul ? Try changing servers hostname to
*.service.consul.
Also try to disable resolution by DNS server, Comment all lines in
/etc/resolve.conf.

Regards
Samir

On Tue, Nov 24, 2015 at 12:29 PM, Kristoffer Sjögren <stoffe@gmail.com>
wrote:

> Only one network interface on all machines. The ping is interesting,
> both machines respond with *.node.dc1.consul but internally
> *.service.consul.
>
> amb1.service.consul /etc/hosts
> 172.17.0.89 amb1.service.consul amb1
> 127.0.0.1 localhost
> ::1 localhost ip6-localhost ip6-loopback
> fe00::0 ip6-localnet
> ff00::0 ip6-mcastprefix
> ff02::1 ip6-allnodes
> ff02::2 ip6-allrouters
>
> amb2.service.consul /etc/hosts
> 172.17.0.90 amb2.service.consul amb2
> 127.0.0.1 localhost
> ::1 localhost ip6-localhost ip6-loopback
> fe00::0 ip6-localnet
> ff00::0 ip6-mcastprefix
> ff02::1 ip6-allnodes
> ff02::2 ip6-allrouters
>
>
> ping amb1 from amb1.service.consul
>
> PING amb1.service.consul (172.17.0.89) 56(84) bytes of data.
> 64 bytes from amb1.service.consul (172.17.0.89): icmp_seq=1 ttl=64
> time=0.059 ms
>
> ping amb2 from amb1.service.consul
>
> PING amb2.service.consul (172.17.0.90) 56(84) bytes of data.
> 64 bytes from amb2.node.dc1.consul (172.17.0.90): icmp_seq=1 ttl=64
> time=0.069 ms
>
> ping amb1 from amb2.service.consul
>
> PING amb1.service.consul (172.17.0.89) 56(84) bytes of data.
> 64 bytes from amb1.node.dc1.consul (172.17.0.89): icmp_seq=1 ttl=64
> time=0.070 ms
>
> ping amb2 from amb2.service.consul
>
> PING amb2.service.consul (172.17.0.90) 56(84) bytes of data.
> 64 bytes from amb2.service.consul (172.17.0.90): icmp_seq=1 ttl=64
> time=0.054 ms
>
> On Tue, Nov 24, 2015 at 11:58 AM, Samir Ahmic <ahmic.samir@gmail.com>
> wrote:
> > As I can see from logs you also have issue with connecting to zk.
> > Configuration points to correct server but  server resolution produce
> wrong
> > values.  Do you have multiple network interfaces on servers?  What ping
> > $HOSTNAME returns? What do you have in /etc/hosts file? Do you have some
> > local nameserver running on servers ?
> >
> > Regards
> > Samir
> > On Nov 24, 2015 11:21 AM, "Kristoffer Sjögren" <stoffe@gmail.com> wrote:
> >
> >> The logs on the region server [1] is also quite interesting.
> >>
> >> Before I restarted the cluster, the region server complains about
> >> hijacked amb2.node.dc1.consul hijacked the regions from
> >> amb2.service.consul.
> >>
> >> 2015-11-24 08:26:45,099 WARN  [RS_OPEN_META-amb2:16020-0]
> >> zookeeper.ZKAssign: regionserver:16020-0x1513899be420000,
> >> quorum=amb1.service.consul:2181, baseZNode=/hbase-unsecure Attempt to
> >> transition the unassigned node for 1588230740 from M_ZK_REGION_OFFLINE
> >> to RS_ZK_REGION_OPENING failed, the server that tried to transition
> >> was amb2.node.dc1.consul,16020,1448353564099 not the expected
> >> amb2.service.consul,16020,1448353564099
> >> 2015-11-24 08:26:45,099 WARN  [RS_OPEN_META-amb2:16020-0]
> >> coordination.ZkOpenRegionCoordination: Failed transition from OFFLINE
> >> to OPENING for region=1588230740
> >> 2015-11-24 08:26:45,099 WARN  [RS_OPEN_META-amb2:16020-0]
> >> handler.OpenRegionHandler: Region was hijacked? Opening cancelled for
> >> encodedName=1588230740
> >> 2015-11-24 08:26:45,100 INFO  [RS_OPEN_META-amb2:16020-0]
> >> coordination.ZkOpenRegionCoordination: Opening of region {ENCODED =>
> >> 1588230740, NAME => 'hbase:meta,,1', STARTKEY => '', ENDKEY => ''}
> >> failed, transitioning from OFFLINE to FAILED_OPEN in ZK, expecting
> >> version 0
> >> 2015-11-24 08:26:45,101 WARN  [RS_OPEN_META-amb2:16020-0]
> >> zookeeper.ZKAssign: regionserver:16020-0x1513899be420000,
> >> quorum=amb1.service.consul:2181, baseZNode=/hbase-unsecure Attempt to
> >> transition the unassigned node for 1588230740 from M_ZK_REGION_OFFLINE
> >> to RS_ZK_REGION_FAILED_OPEN failed, the server that tried to
> >> transition was amb2.node.dc1.consul,16020,1448353564099 not the
> >> expected amb2.service.consul,16020,1448353564099
> >>
> >>
> >> After editing resolv.conf and restarted the cluster it still complains
> >> about amb2.node.dc1.consul trying to transition the regions instead of
> >> amb2.service.consul.
> >>
> >> 2015-11-24 09:32:26,334 WARN  [RS_OPEN_META-amb2:16020-0]
> >> zookeeper.ZKAssign: regionserver:16020-0x1513899be42000d,
> >> quorum=amb1.service.consul:2181, baseZNode=/hbase-unsecure Attempt to
> >> transition the unassigned node for 1588230740 from M_ZK_REGION_OFFLINE
> >> to RS_ZK_REGION_OPENING failed, the server that tried to transition
> >> was amb2.node.dc1.consul,16020,1448357534179 not the expected
> >> amb2.service.consul,16020,1448357534179
> >> 2015-11-24 09:32:26,335 WARN  [RS_OPEN_META-amb2:16020-0]
> >> coordination.ZkOpenRegionCoordination: Failed transition from OFFLINE
> >> to OPENING for region=1588230740
> >> 2015-11-24 09:32:26,335 WARN  [RS_OPEN_META-amb2:16020-0]
> >> handler.OpenRegionHandler: Region was hijacked? Opening cancelled for
> >> encodedName=1588230740
> >> 2015-11-24 09:32:26,335 INFO  [RS_OPEN_META-amb2:16020-0]
> >> coordination.ZkOpenRegionCoordination: Opening of region {ENCODED =>
> >> 1588230740, NAME => 'hbase:meta,,1', STARTKEY => '', ENDKEY => ''}
> >> failed, transitioning from OFFLINE to FAILED_OPEN in ZK, expecting
> >> version 2
> >> 2015-11-24 09:32:26,336 WARN  [RS_OPEN_META-amb2:16020-0]
> >> zookeeper.ZKAssign: regionserver:16020-0x1513899be42000d,
> >> quorum=amb1.service.consul:2181, baseZNode=/hbase-unsecure Attempt to
> >> transition the unassigned node for 1588230740 from M_ZK_REGION_OFFLINE
> >> to RS_ZK_REGION_FAILED_OPEN failed, the server that tried to
> >> transition was amb2.node.dc1.consul,16020,1448357534179 not the
> >> expected amb2.service.consul,16020,1448357534179
> >>
> >>
> >> [1] http://pastebin.com/z93p8Mdu
> >>
> >> On Tue, Nov 24, 2015 at 10:48 AM, Kristoffer Sjögren <stoffe@gmail.com>
> >> wrote:
> >> > I removed the node.dc1.consul from resolve.conf and restarted the
> >> > cluster but it still shows up on the master UI.
> >> >
> >> > amb2.node.dc1.consul,16020,1448353564099Tue Nov 24 08:26:04 UTC 201500
> >> > amb2.service.consul,16020,1448353564099Tue Nov 24 08:26:04 UTC 201500
> >> >
> >> > The logs report [1] that the meta region fails to assign to
> >> > node.dc1.consul and then tries to assign it to amb2.service.consul and
> >> > gets stuck in PENDING_OPEN again.
> >> >
> >> > ---
> >> > 1588230740hbase:meta,,1.1588230740 state=PENDING_OPEN, ts=Tue Nov 24
> >> > 09:32:26 UTC 2015 (450s ago),
> >> > server=amb2.service.consul,16020,1448357534179450511
> >> > ---
> >> >
> >> > Before I restarted the cluster, the master log [2] complained about
> >> > not being able to connect to amb2.node.dc1.consul/172.17.0.85:16020.
> >> >
> >> > Im not sure but somehow it feels as if amb2.node.dc1.consul shadows
> >> > the real host amb2.service.consul.
> >> >
> >> > I was looking into the source code and found the configuration
> >> > 'hbase.regionserver.hostname' - could that be of help here to remove
> >> > the node.dc1 host?
> >> >
> >> > [1] http://pastebin.com/uZKqK9BJ
> >> > [2] http://pastebin.com/s10E2rtA
> >> >
> >> > On Tue, Nov 24, 2015 at 10:23 AM, Samir Ahmic <ahmic.samir@gmail.com>
> >> wrote:
> >> >> Hi Kristoffer,
> >> >> It looks like you have some issue with name resolution. Try to remove
> >> >> incorrect value from reslove.conf (node.dc1.consul) and then restart
> >> hbase
> >> >> cluster.
> >> >> Regarding issue with region in transition check master log for
> >> >> "hbase:meta,,1.1588230740"
> >> >> there should be exception explaining why hbase:meta can to be
> transition
> >> >> from PENDING_OPEN to OPEN state, if hbase:meta table is unavailable
> >> master
> >> >> can not finish initialization.
> >> >>
> >> >> Regards
> >> >> Samir
> >> >>
> >> >> On Tue, Nov 24, 2015 at 10:11 AM, Kristoffer Sjögren <
> stoffe@gmail.com>
> >> >> wrote:
> >> >>
> >> >>> Sorry, I should mention that this is HBase 1.1.2.
> >> >>>
> >> >>> Zookeeper only report one region server.
> >> >>>
> >> >>> $ ls /hbase-unsecure/rs
> >> >>> [amb2.service.consul,16020,1448353564099]
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>> On Tue, Nov 24, 2015 at 9:55 AM, Kristoffer Sjögren <
> stoffe@gmail.com>
> >> >>> wrote:
> >> >>> > Hi
> >> >>> >
> >> >>> > I'm trying to install a HBase cluster with 1 master
> >> >>> > (amb1.service.consul) and 1 region server (amb2.service.consul)
> using
> >> >>> > Ambari on docker containers provided by sequenceiq [1] using
a
> custom
> >> >>> > blueprint [2].
> >> >>> >
> >> >>> > Every component installs correctly except for HBase which
get
> stuck
> >> >>> > with regions in transition:
> >> >>> >
> >> >>> > ---
> >> >>> > hbase:meta,,1.1588230740 state=PENDING_OPEN, ts=Tue Nov 24
> 08:26:45
> >> >>> > UTC 2015 (1098s ago),
> server=amb2.service.consul,16020,1448353564099
> >> >>> > ---
> >> >>> >
> >> >>> > And for some reason 2 region servers (instead of 1) are
> discovered by
> >> >>> > the master with the exact same timestamp but with different
> >> hostnames.
> >> >>> > I'm not sure if this is the reason why the regions get stuck.
> >> >>> >
> >> >>> > ----
> >> >>> > amb2.node.dc1.consul,16020,1448353564099Tue Nov 24 08:26:04
UTC
> >> 201500
> >> >>> > amb2.service.consul,16020,1448353564099Tue Nov 24 08:26:04
UTC
> 201500
> >> >>> > ----
> >> >>> >
> >> >>> > The only place I can find "amb2.node.dc1.consul" on the ambari
> >> >>> > agent/server hosts is in /etc/resolv.conf which looks like
this.
> >> >>> >
> >> >>> > ----
> >> >>> > nameserver 172.17.0.82
> >> >>> > search service.consul node.dc1.consul
> >> >>> > ----
> >> >>> >
> >> >>> > Is there some way that I can manually tell the master to disregard
> >> the
> >> >>> > "phantom" host amb2.node.dc1.consul?
> >> >>> >
> >> >>> > Any help or tips appreciated.
> >> >>> >
> >> >>> > Cheers,
> >> >>> > -Kristoffer
> >> >>> >
> >> >>> >
> >> >>> > [1] https://github.com/sequenceiq/docker-ambari
> >> >>> > [2]
> >> >>>
> >>
> https://gist.githubusercontent.com/krisskross/901ed8223c1ed1db80e3/raw/869327be9ad15e6a9f099a7591323244cd245357/ambari-hdp2.3
> >> >>>
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message