Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DC31818588 for ; Tue, 24 Nov 2015 12:22:01 +0000 (UTC) Received: (qmail 44861 invoked by uid 500); 24 Nov 2015 12:22:00 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 44785 invoked by uid 500); 24 Nov 2015 12:22:00 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 44770 invoked by uid 99); 24 Nov 2015 12:22:00 -0000 Received: from Unknown (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 24 Nov 2015 12:22:00 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id AF5FAC0845 for ; Tue, 24 Nov 2015 12:21:59 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 4.901 X-Spam-Level: **** X-Spam-Status: No, score=4.901 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, KAM_BADIPHTTP=2, SPF_PASS=-0.001, URIBL_BLOCKED=0.001, WEIRD_PORT=0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id v4H-R9JzLNQQ for ; Tue, 24 Nov 2015 12:21:48 +0000 (UTC) Received: from mail-ob0-f182.google.com (mail-ob0-f182.google.com [209.85.214.182]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id 644AD2064F for ; Tue, 24 Nov 2015 12:21:47 +0000 (UTC) Received: by obbbj7 with SMTP id bj7so11512723obb.1 for ; Tue, 24 Nov 2015 04:21:40 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=9x9eph5NgrqL/aGVpi/ylJTRFWDq0d+AHmy2zAgQA+k=; b=ou2RdVU5z/tFu/r7GACRc+B9dxDIbvBhJ1tEaU9VS4SH169pI4Al2e2UJGIAh33maD 08SeA/apdAOfAzIhf8IdIBCJaHqYUQQuISUfcLnMNswSr7hWSH9y5LrfJar4ih+9oQT1 fpF2CivcmMrwgRV/gM0fpROv1u+UBNeiG3qEEy0ztVqP+HQuL3DvIxE5R+D1wrmAKKTE w57aWJj/JPy3BrRDc9oydf+6wK7M1CeZvj0/lec55iDr7WRRbwpu2cnv4AOAnBcWp9Hf 1C5jPhCr9uPLgsOAnOkiVRlppmqHf6JLmnJDF+59l++ex8DHJOp8iY704DbZSpThql7q nbdw== MIME-Version: 1.0 X-Received: by 10.60.33.232 with SMTP id u8mr21007686oei.22.1448367700114; Tue, 24 Nov 2015 04:21:40 -0800 (PST) Received: by 10.76.0.103 with HTTP; Tue, 24 Nov 2015 04:21:40 -0800 (PST) In-Reply-To: References: Date: Tue, 24 Nov 2015 13:21:40 +0100 Message-ID: Subject: Re: Phantom region server and PENDING_OPEN regions From: Samir Ahmic To: user@hbase.apache.org Content-Type: multipart/alternative; boundary=089e013d0524450477052548621b --089e013d0524450477052548621b Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Your hosts file looks fine. If i understand correctly value of $HOSTNAME env variable is *.node.dc1.consul ? Try changing servers hostname to *.service.consul. Also try to disable resolution by DNS server, Comment all lines in /etc/resolve.conf. Regards Samir On Tue, Nov 24, 2015 at 12:29 PM, Kristoffer Sj=C3=B6gren wrote: > Only one network interface on all machines. The ping is interesting, > both machines respond with *.node.dc1.consul but internally > *.service.consul. > > amb1.service.consul /etc/hosts > 172.17.0.89 amb1.service.consul amb1 > 127.0.0.1 localhost > ::1 localhost ip6-localhost ip6-loopback > fe00::0 ip6-localnet > ff00::0 ip6-mcastprefix > ff02::1 ip6-allnodes > ff02::2 ip6-allrouters > > amb2.service.consul /etc/hosts > 172.17.0.90 amb2.service.consul amb2 > 127.0.0.1 localhost > ::1 localhost ip6-localhost ip6-loopback > fe00::0 ip6-localnet > ff00::0 ip6-mcastprefix > ff02::1 ip6-allnodes > ff02::2 ip6-allrouters > > > ping amb1 from amb1.service.consul > > PING amb1.service.consul (172.17.0.89) 56(84) bytes of data. > 64 bytes from amb1.service.consul (172.17.0.89): icmp_seq=3D1 ttl=3D64 > time=3D0.059 ms > > ping amb2 from amb1.service.consul > > PING amb2.service.consul (172.17.0.90) 56(84) bytes of data. > 64 bytes from amb2.node.dc1.consul (172.17.0.90): icmp_seq=3D1 ttl=3D64 > time=3D0.069 ms > > ping amb1 from amb2.service.consul > > PING amb1.service.consul (172.17.0.89) 56(84) bytes of data. > 64 bytes from amb1.node.dc1.consul (172.17.0.89): icmp_seq=3D1 ttl=3D64 > time=3D0.070 ms > > ping amb2 from amb2.service.consul > > PING amb2.service.consul (172.17.0.90) 56(84) bytes of data. > 64 bytes from amb2.service.consul (172.17.0.90): icmp_seq=3D1 ttl=3D64 > time=3D0.054 ms > > On Tue, Nov 24, 2015 at 11:58 AM, Samir Ahmic > wrote: > > As I can see from logs you also have issue with connecting to zk. > > Configuration points to correct server but server resolution produce > wrong > > values. Do you have multiple network interfaces on servers? What ping > > $HOSTNAME returns? What do you have in /etc/hosts file? Do you have som= e > > local nameserver running on servers ? > > > > Regards > > Samir > > On Nov 24, 2015 11:21 AM, "Kristoffer Sj=C3=B6gren" = wrote: > > > >> The logs on the region server [1] is also quite interesting. > >> > >> Before I restarted the cluster, the region server complains about > >> hijacked amb2.node.dc1.consul hijacked the regions from > >> amb2.service.consul. > >> > >> 2015-11-24 08:26:45,099 WARN [RS_OPEN_META-amb2:16020-0] > >> zookeeper.ZKAssign: regionserver:16020-0x1513899be420000, > >> quorum=3Damb1.service.consul:2181, baseZNode=3D/hbase-unsecure Attempt= to > >> transition the unassigned node for 1588230740 from M_ZK_REGION_OFFLINE > >> to RS_ZK_REGION_OPENING failed, the server that tried to transition > >> was amb2.node.dc1.consul,16020,1448353564099 not the expected > >> amb2.service.consul,16020,1448353564099 > >> 2015-11-24 08:26:45,099 WARN [RS_OPEN_META-amb2:16020-0] > >> coordination.ZkOpenRegionCoordination: Failed transition from OFFLINE > >> to OPENING for region=3D1588230740 > >> 2015-11-24 08:26:45,099 WARN [RS_OPEN_META-amb2:16020-0] > >> handler.OpenRegionHandler: Region was hijacked? Opening cancelled for > >> encodedName=3D1588230740 > >> 2015-11-24 08:26:45,100 INFO [RS_OPEN_META-amb2:16020-0] > >> coordination.ZkOpenRegionCoordination: Opening of region {ENCODED =3D> > >> 1588230740, NAME =3D> 'hbase:meta,,1', STARTKEY =3D> '', ENDKEY =3D> '= '} > >> failed, transitioning from OFFLINE to FAILED_OPEN in ZK, expecting > >> version 0 > >> 2015-11-24 08:26:45,101 WARN [RS_OPEN_META-amb2:16020-0] > >> zookeeper.ZKAssign: regionserver:16020-0x1513899be420000, > >> quorum=3Damb1.service.consul:2181, baseZNode=3D/hbase-unsecure Attempt= to > >> transition the unassigned node for 1588230740 from M_ZK_REGION_OFFLINE > >> to RS_ZK_REGION_FAILED_OPEN failed, the server that tried to > >> transition was amb2.node.dc1.consul,16020,1448353564099 not the > >> expected amb2.service.consul,16020,1448353564099 > >> > >> > >> After editing resolv.conf and restarted the cluster it still complains > >> about amb2.node.dc1.consul trying to transition the regions instead of > >> amb2.service.consul. > >> > >> 2015-11-24 09:32:26,334 WARN [RS_OPEN_META-amb2:16020-0] > >> zookeeper.ZKAssign: regionserver:16020-0x1513899be42000d, > >> quorum=3Damb1.service.consul:2181, baseZNode=3D/hbase-unsecure Attempt= to > >> transition the unassigned node for 1588230740 from M_ZK_REGION_OFFLINE > >> to RS_ZK_REGION_OPENING failed, the server that tried to transition > >> was amb2.node.dc1.consul,16020,1448357534179 not the expected > >> amb2.service.consul,16020,1448357534179 > >> 2015-11-24 09:32:26,335 WARN [RS_OPEN_META-amb2:16020-0] > >> coordination.ZkOpenRegionCoordination: Failed transition from OFFLINE > >> to OPENING for region=3D1588230740 > >> 2015-11-24 09:32:26,335 WARN [RS_OPEN_META-amb2:16020-0] > >> handler.OpenRegionHandler: Region was hijacked? Opening cancelled for > >> encodedName=3D1588230740 > >> 2015-11-24 09:32:26,335 INFO [RS_OPEN_META-amb2:16020-0] > >> coordination.ZkOpenRegionCoordination: Opening of region {ENCODED =3D> > >> 1588230740, NAME =3D> 'hbase:meta,,1', STARTKEY =3D> '', ENDKEY =3D> '= '} > >> failed, transitioning from OFFLINE to FAILED_OPEN in ZK, expecting > >> version 2 > >> 2015-11-24 09:32:26,336 WARN [RS_OPEN_META-amb2:16020-0] > >> zookeeper.ZKAssign: regionserver:16020-0x1513899be42000d, > >> quorum=3Damb1.service.consul:2181, baseZNode=3D/hbase-unsecure Attempt= to > >> transition the unassigned node for 1588230740 from M_ZK_REGION_OFFLINE > >> to RS_ZK_REGION_FAILED_OPEN failed, the server that tried to > >> transition was amb2.node.dc1.consul,16020,1448357534179 not the > >> expected amb2.service.consul,16020,1448357534179 > >> > >> > >> [1] http://pastebin.com/z93p8Mdu > >> > >> On Tue, Nov 24, 2015 at 10:48 AM, Kristoffer Sj=C3=B6gren > >> wrote: > >> > I removed the node.dc1.consul from resolve.conf and restarted the > >> > cluster but it still shows up on the master UI. > >> > > >> > amb2.node.dc1.consul,16020,1448353564099Tue Nov 24 08:26:04 UTC 2015= 00 > >> > amb2.service.consul,16020,1448353564099Tue Nov 24 08:26:04 UTC 20150= 0 > >> > > >> > The logs report [1] that the meta region fails to assign to > >> > node.dc1.consul and then tries to assign it to amb2.service.consul a= nd > >> > gets stuck in PENDING_OPEN again. > >> > > >> > --- > >> > 1588230740hbase:meta,,1.1588230740 state=3DPENDING_OPEN, ts=3DTue No= v 24 > >> > 09:32:26 UTC 2015 (450s ago), > >> > server=3Damb2.service.consul,16020,1448357534179450511 > >> > --- > >> > > >> > Before I restarted the cluster, the master log [2] complained about > >> > not being able to connect to amb2.node.dc1.consul/172.17.0.85:16020. > >> > > >> > Im not sure but somehow it feels as if amb2.node.dc1.consul shadows > >> > the real host amb2.service.consul. > >> > > >> > I was looking into the source code and found the configuration > >> > 'hbase.regionserver.hostname' - could that be of help here to remove > >> > the node.dc1 host? > >> > > >> > [1] http://pastebin.com/uZKqK9BJ > >> > [2] http://pastebin.com/s10E2rtA > >> > > >> > On Tue, Nov 24, 2015 at 10:23 AM, Samir Ahmic > >> wrote: > >> >> Hi Kristoffer, > >> >> It looks like you have some issue with name resolution. Try to remo= ve > >> >> incorrect value from reslove.conf (node.dc1.consul) and then restar= t > >> hbase > >> >> cluster. > >> >> Regarding issue with region in transition check master log for > >> >> "hbase:meta,,1.1588230740" > >> >> there should be exception explaining why hbase:meta can to be > transition > >> >> from PENDING_OPEN to OPEN state, if hbase:meta table is unavailable > >> master > >> >> can not finish initialization. > >> >> > >> >> Regards > >> >> Samir > >> >> > >> >> On Tue, Nov 24, 2015 at 10:11 AM, Kristoffer Sj=C3=B6gren < > stoffe@gmail.com> > >> >> wrote: > >> >> > >> >>> Sorry, I should mention that this is HBase 1.1.2. > >> >>> > >> >>> Zookeeper only report one region server. > >> >>> > >> >>> $ ls /hbase-unsecure/rs > >> >>> [amb2.service.consul,16020,1448353564099] > >> >>> > >> >>> > >> >>> > >> >>> > >> >>> On Tue, Nov 24, 2015 at 9:55 AM, Kristoffer Sj=C3=B6gren < > stoffe@gmail.com> > >> >>> wrote: > >> >>> > Hi > >> >>> > > >> >>> > I'm trying to install a HBase cluster with 1 master > >> >>> > (amb1.service.consul) and 1 region server (amb2.service.consul) > using > >> >>> > Ambari on docker containers provided by sequenceiq [1] using a > custom > >> >>> > blueprint [2]. > >> >>> > > >> >>> > Every component installs correctly except for HBase which get > stuck > >> >>> > with regions in transition: > >> >>> > > >> >>> > --- > >> >>> > hbase:meta,,1.1588230740 state=3DPENDING_OPEN, ts=3DTue Nov 24 > 08:26:45 > >> >>> > UTC 2015 (1098s ago), > server=3Damb2.service.consul,16020,1448353564099 > >> >>> > --- > >> >>> > > >> >>> > And for some reason 2 region servers (instead of 1) are > discovered by > >> >>> > the master with the exact same timestamp but with different > >> hostnames. > >> >>> > I'm not sure if this is the reason why the regions get stuck. > >> >>> > > >> >>> > ---- > >> >>> > amb2.node.dc1.consul,16020,1448353564099Tue Nov 24 08:26:04 UTC > >> 201500 > >> >>> > amb2.service.consul,16020,1448353564099Tue Nov 24 08:26:04 UTC > 201500 > >> >>> > ---- > >> >>> > > >> >>> > The only place I can find "amb2.node.dc1.consul" on the ambari > >> >>> > agent/server hosts is in /etc/resolv.conf which looks like this. > >> >>> > > >> >>> > ---- > >> >>> > nameserver 172.17.0.82 > >> >>> > search service.consul node.dc1.consul > >> >>> > ---- > >> >>> > > >> >>> > Is there some way that I can manually tell the master to disrega= rd > >> the > >> >>> > "phantom" host amb2.node.dc1.consul? > >> >>> > > >> >>> > Any help or tips appreciated. > >> >>> > > >> >>> > Cheers, > >> >>> > -Kristoffer > >> >>> > > >> >>> > > >> >>> > [1] https://github.com/sequenceiq/docker-ambari > >> >>> > [2] > >> >>> > >> > https://gist.githubusercontent.com/krisskross/901ed8223c1ed1db80e3/raw/86= 9327be9ad15e6a9f099a7591323244cd245357/ambari-hdp2.3 > >> >>> > >> > --089e013d0524450477052548621b--