Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9F81C104AC for ; Sun, 2 Jun 2013 19:27:19 +0000 (UTC) Received: (qmail 17509 invoked by uid 500); 2 Jun 2013 19:27:17 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 17458 invoked by uid 500); 2 Jun 2013 19:27:17 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 17450 invoked by uid 99); 2 Jun 2013 19:27:17 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 02 Jun 2013 19:27:17 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of asaf.mesika@gmail.com designates 209.85.215.172 as permitted sender) Received: from [209.85.215.172] (HELO mail-ea0-f172.google.com) (209.85.215.172) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 02 Jun 2013 19:27:11 +0000 Received: by mail-ea0-f172.google.com with SMTP id g14so605598eak.31 for ; Sun, 02 Jun 2013 12:26:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=content-type:mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to:x-mailer; bh=px0MUSlpP4UQ9sTsVhqtB9znPYFrLls5Hpxue0RVcZg=; b=II4/baHv4vKZRHZk8ljSeOxx488QQzso2kY9yOBL8RkCpgOSHonOYPnUOGdS+jU+N6 dpd5OtjJ7JheDuYoLIQTUF/RUQfrnShIrQ1I6X78Npdm+eYvID34KEUbTwS3xbDeDZCT ydI9YnCGKDtWj3YS42PmjO8/k8ECAuKk8wRVS17nbBwdbPrsPUuaWTp49oUbLm2lWZyC zkW+pK+KOl8R5X/GNxH4SlMkFxR4G2ennANAAl9LOpxcHBi/Uj22gNVhhByvjjrQG9Hb xA2Kndp+AjssTLu6logYWhN5DVw8giO6CdITHAUllRedt1VLTYhVYeHyGVeaTXjgpicS hkxQ== X-Received: by 10.14.198.136 with SMTP id v8mr20563719een.68.1370201210173; Sun, 02 Jun 2013 12:26:50 -0700 (PDT) Received: from new-host-2.home (93-173-46-241.bb.netvision.net.il. [93.173.46.241]) by mx.google.com with ESMTPSA id bo9sm66131224eeb.9.2013.06.02.12.26.48 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Sun, 02 Jun 2013 12:26:49 -0700 (PDT) Content-Type: text/plain; charset=iso-8859-1 Mime-Version: 1.0 (Mac OS X Mail 6.3 \(1503\)) Subject: Re: Weird Replication exception From: Asaf Mesika In-Reply-To: Date: Sun, 2 Jun 2013 22:26:49 +0300 Content-Transfer-Encoding: quoted-printable Message-Id: References: To: user@hbase.apache.org X-Mailer: Apple Mail (2.1503) X-Virus-Checked: Checked by ClamAV on apache.org I use 0.94.7. Is 0.94.8 production ready? So in summary I have two issues: 1. Clocks are out of sync 2. I need to upgrade to 0.94.8 to avoid seeing this WARN messages? On Jun 2, 2013, at 5:46 PM, Ted Yu wrote: > What is the HBase version you're using ? >=20 > In another thread, I mentioned this: >=20 > There was a recently integrated JIRA (0.94.8): > HBASE-7122 Proper warning message when opening a log file with no = entries > (idle cluster) >=20 > Does the HBase you're using contain HBASE-7122 ? >=20 > Cheers >=20 > On Sat, Jun 1, 2013 at 10:20 PM, Asaf Mesika = wrote: >=20 >> Hi, >>=20 >> I have a weird error in a cluster I'm checking Replication with. >>=20 >> I have two clusters set up, each on its own DC (different = continents). Each >> has 1 master, and 3 RS. >>=20 >> I've done all required setup, started replication and pushed in some = data >> into the master. I had an issue where the slave (peer) cluster went = dead >> (all RS failed contacting the master), thus replication couldn't = work. This >> happened right before the weekend, so it was out for 3 days. >>=20 >> Now I'm back in the office - got slave cluster back up (just the RS), = and I >> got some nasty exception in one of the RS of the master cluster: >>=20 >> 2013-06-02 04:40:45,903 INFO >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: = Getting >> 0 rs from peer cluster # c >> 2013-06-02 04:40:45,903 INFO >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: = Slave >> cluster looks down: c has 0 region servers >> 2013-06-02 04:40:46,903 DEBUG >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: = Since >> we are unable to replicate, sleeping 1000 times 10 >> 2013-06-02 04:40:57,019 INFO >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: = Getting >> 0 rs from peer cluster # c >> 2013-06-02 04:40:57,019 INFO >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: = Slave >> cluster looks down: c has 0 region servers >> 2013-06-02 04:40:58,019 DEBUG >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: = Since >> we are unable to replicate, sleeping 1000 times 10 >> 2013-06-02 04:41:08,134 INFO >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: = Getting >> 1 rs from peer cluster # c >> 2013-06-02 04:41:08,134 INFO >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: >> Choosing peer a72-246-95-86,60020,1370147274693 >> 2013-06-02 04:41:08,672 DEBUG >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: >> Replicating 1 >> 2013-06-02 04:41:08,971 INFO >> = org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager:= >> Going to report log = #172.25.98.74%2C60020%2C1369903540894.1369921571842 for >> position >> 49882 in hdfs:// >>=20 >> = 172.25.98.73:8020/hbase/.logs/172.25.98.74,60020,1369903540894/172.25.98.7= 4%2C60020%2C1369903540894.1369921571842 >> 2013-06-02 04:41:08,973 DEBUG >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: >> Replicated in total: 1 >> 2013-06-02 04:41:08,973 DEBUG >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: = Opening >> log for replication = 172.25.98.74%2C60020%2C1369903540894.1369921571842 at >> 49882 >> 2013-06-02 04:41:08,977 DEBUG >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: >> currentNbOperations:0 and seenEntries:4 and size: 0 >> 2013-06-02 04:41:08,977 INFO >> = org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager:= >> Going to report log = #172.25.98.74%2C60020%2C1369903540894.1369921571842 for >> position >> 69368 in hdfs:// >>=20 >> = 172.25.98.73:8020/hbase/.logs/172.25.98.74,60020,1369903540894/172.25.98.7= 4%2C60020%2C1369903540894.1369921571842 >> 2013-06-02 04:41:08,978 INFO >> = org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager:= >> Removing 0 logs in the list: [] >> 2013-06-02 04:41:08,978 DEBUG >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: = Nothing >> to replicate, sleeping 1000 times 1 >> 2013-06-02 04:41:09,978 DEBUG >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: = Opening >> log for replication = 172.25.98.74%2C60020%2C1369903540894.1369921571842 at >> 69368 >> 2013-06-02 04:41:09,980 DEBUG >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: >> currentNbOperations:0 and seenEntries:0 and size: 0 >> 2013-06-02 04:41:09,980 DEBUG >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: = Opening >> log for replication = 172.25.98.74%2C60020%2C1369903540894.1369925171871 at 0 >> 2013-06-02 04:41:09,981 WARN >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: c = Got: >> java.io.EOFException >> at java.io.DataInputStream.readFully(DataInputStream.java:180) >> at java.io.DataInputStream.readFully(DataInputStream.java:152) >> at >> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508) >> at >> = org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1486) >> at >> = org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1475) >> at >> = org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1470) >> at >>=20 >> = org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<= init>(SequenceFileLogReader.java:55) >> at >>=20 >> = org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(Sequen= ceFileLogReader.java:177) >> at >> = org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:742) >> at >>=20 >> = org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderMana= ger.openReader(ReplicationHLogReaderManager.java:67) >> at >>=20 >> = org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openRea= der(ReplicationSource.java:563) >> at >>=20 >> = org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(Rep= licationSource.java:369) >> 2013-06-02 04:41:09,981 DEBUG >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: = Unable >> to open a reader, sleeping 1000 times 1 >> 2013-06-02 04:41:10,981 DEBUG >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: = Opening >> log for replication = 172.25.98.74%2C60020%2C1369903540894.1369925171871 at 0 >>=20 >>=20 >> Ok, so I took a look in HDFS for this logs, and here's what I had in = store >> for RS 74: >>=20 >> adoop@172.25.98.73:~$ hadoop fs -ls /hbase/.logs >> Found 3 items >> drwxr-xr-x - hadoop supergroup 0 2013-05-30 14:46 = /hbase/.logs/ >> 172.25.98.74,60020,1369903540894 >> drwxr-xr-x - hadoop supergroup 0 2013-05-30 12:46 = /hbase/.logs/ >> 172.25.98.75,60020,1369903540900 >> drwxr-xr-x - hadoop supergroup 0 2013-05-30 14:46 = /hbase/.logs/ >> 172.25.98.76,60020,1369903540898 >> hadoop@172.25.98.73:~$ hadoop fs -ls /hbase/.logs//hbase/.logs/ >> 172.25.98.74 >> ,60020,1369903540894 >> hadoop@172.25.98.73:~$ hadoop fs -ls /hbase/.logs/172.25.98.74 >> ,60020,1369903540894 >> Found 5 items >> -rw-r--r-- 3 hadoop supergroup 1380 2013-05-30 08:46 = /hbase/.logs/ >> 172.25.98.74,60020,1369903540894/172.25.98.74 >> %2C60020%2C1369903540894.1369903571677 >> -rw-r--r-- 3 hadoop supergroup 59110 2013-05-30 09:46 = /hbase/.logs/ >> 172.25.98.74,60020,1369903540894/172.25.98.74 >> %2C60020%2C1369903540894.1369907171734 >> -rw-r--r-- 3 hadoop supergroup 80508 2013-05-30 12:46 = /hbase/.logs/ >> 172.25.98.74,60020,1369903540894/172.25.98.74 >> %2C60020%2C1369903540894.1369917971811 >> -rw-r--r-- 3 hadoop supergroup 69368 2013-05-30 13:46 = /hbase/.logs/ >> 172.25.98.74,60020,1369903540894/172.25.98.74 >> %2C60020%2C1369903540894.1369921571842 >> -rw-r--r-- 3 hadoop supergroup 0 2013-05-30 14:46 = /hbase/.logs/ >> 172.25.98.74,60020,1369903540894/172.25.98.74 >> %2C60020%2C1369903540894.1369925171871 >>=20 >> So I saw the log ending with 871 is of 0 length - so I've decided to = delete >> it from ZK and restart RS 74, so its in-memory list will get = refreshed from >> ZK. >>=20 >> After that: >>=20 >> - I lost the logs for /hbase/.logs/172.25.98.74,60020,1369903540894 = - >> Meaning I lost data >>=20 >> This is HDFS afterwards: >>=20 >> hadoop@172.25.98.73:~$ hadoop fs -ls /hbase/.logs/ >>=20 >> Found 3 items >> drwxr-xr-x - hadoop supergroup 0 2013-06-02 05:02 = /hbase/.logs/ >> 172.25.98.74,60020,1370149320299 >> drwxr-xr-x - hadoop supergroup 0 2013-05-30 12:46 = /hbase/.logs/ >> 172.25.98.75,60020,1369903540900 >> drwxr-xr-x - hadoop supergroup 0 2013-05-30 14:46 = /hbase/.logs/ >> 172.25.98.76,60020,1369903540898 >> hadoop@172.25.98.73:~$ hadoop fs -ls /hbase/.logs/172.25.98.74 >> ,60020,1370149320299 >> Found 1 items >> -rw-r--r-- 3 hadoop supergroup 0 2013-06-02 05:02 = /hbase/.logs/ >> 172.25.98.74,60020,1370149320299/172.25.98.74 >> %2C60020%2C1370149320299.1370149321515 >>=20 >>=20 >> - RS 74 is now stuck at same exception, but on the new WAL which = just >> opened but it empty (since no data has been pushed yet): >>=20 >> 2013-06-02 05:19:18,475 WARN >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: = Waited >> too long for this file, considering dumping >> 2013-06-02 05:19:18,475 DEBUG >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: = Unable >> to open a reader, sleeping 1000 times 10 >> 2013-06-02 05:19:28,475 DEBUG >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: = Opening >> log for replication = 172.25.98.74%2C60020%2C1370149320299.1370149321515 at 0 >> 2013-06-02 05:19:28,477 WARN >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: c = Got: >> java.io.EOFException >> at java.io.DataInputStream.readFully(DataInputStream.java:180) >> at java.io.DataInputStream.readFully(DataInputStream.java:152) >> at >> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508) >> at >> = org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1486) >> at >> = org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1475) >> at >> = org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1470) >> at >>=20 >> = org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<= init>(SequenceFileLogReader.java:55) >> at >>=20 >> = org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(Sequen= ceFileLogReader.java:177) >> at >> = org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:742) >> at >>=20 >> = org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderMana= ger.openReader(ReplicationHLogReaderManager.java:67) >> at >>=20 >> = org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openRea= der(ReplicationSource.java:563) >> at >>=20 >> = org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(Rep= licationSource.java:369) >> 2013-06-02 05:19:28,477 WARN >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: = Waited >> too long for this file, considering dumping >> 2013-06-02 05:19:28,477 DEBUG >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: = Unable >> to open a reader, sleeping 1000 times 10 >>=20 >>=20 >> So, after this long explanation - is this a bug? >>=20