Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (athena.apache.org: domain of asaf.mesika@gmail.com
 designates 209.85.215.172 as permitted sender)
Content-Type: text/plain; charset=iso-8859-1
Mime-Version: 1.0 (Mac OS X Mail 6.3 \(1503\))
Subject: Re: Weird Replication exception
From: Asaf Mesika <asaf.mesika@gmail.com>
In-Reply-To: 
 <CALte62wSH8+e5=1nC9xneCn+Dni4FJkBEFAy7HtEGavLaRGqhg@mail.gmail.com>
Date: Sun, 2 Jun 2013 22:26:49 +0300
Content-Transfer-Encoding: quoted-printable
Message-Id: <E3F55F00-333B-4D92-8037-EE7FE1F491F2@gmail.com>
References: 
 <CA+r7Yvmt-5OR6d4_dWBUZ9BLngO1h7tmeCN-0x0PnpX2yeOyrQ@mail.gmail.com>
 <CALte62wSH8+e5=1nC9xneCn+Dni4FJkBEFAy7HtEGavLaRGqhg@mail.gmail.com>
To: user@hbase.apache.org

I use 0.94.7.
Is 0.94.8 production ready?

So in summary I have two issues:
1. Clocks are out of sync
2. I need to upgrade to 0.94.8 to avoid seeing this WARN messages?

On Jun 2, 2013, at 5:46 PM, Ted Yu <yuzhihong@gmail.com> wrote:

> What is the HBase version you're using ?
>=20
> In another thread, I mentioned this:
>=20
> There was a recently integrated JIRA (0.94.8):
> HBASE-7122 Proper warning message when opening a log file with no =
entries
> (idle cluster)
>=20
> Does the HBase you're using contain HBASE-7122 ?
>=20
> Cheers
>=20
> On Sat, Jun 1, 2013 at 10:20 PM, Asaf Mesika <asaf.mesika@gmail.com> =
wrote:
>=20
>> Hi,
>>=20
>> I have a weird error in a cluster I'm checking Replication with.
>>=20
>> I have two clusters set up, each on its own DC (different =
continents). Each
>> has 1 master, and 3 RS.
>>=20
>> I've done all required setup, started replication and pushed in some =
data
>> into the master. I had an issue where the slave (peer) cluster went =
dead
>> (all RS failed contacting the master), thus replication couldn't =
work. This
>> happened right before the weekend, so it was out for 3 days.
>>=20
>> Now I'm back in the office - got slave cluster back up (just the RS), =
and I
>> got some nasty exception in one of the RS of the master cluster:
>>=20
>> 2013-06-02 04:40:45,903 INFO
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: =
Getting
>> 0 rs from peer cluster # c
>> 2013-06-02 04:40:45,903 INFO
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: =
Slave
>> cluster looks down: c has 0 region servers
>> 2013-06-02 04:40:46,903 DEBUG
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: =
Since
>> we are unable to replicate, sleeping 1000 times 10
>> 2013-06-02 04:40:57,019 INFO
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: =
Getting
>> 0 rs from peer cluster # c
>> 2013-06-02 04:40:57,019 INFO
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: =
Slave
>> cluster looks down: c has 0 region servers
>> 2013-06-02 04:40:58,019 DEBUG
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: =
Since
>> we are unable to replicate, sleeping 1000 times 10
>> 2013-06-02 04:41:08,134 INFO
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: =
Getting
>> 1 rs from peer cluster # c
>> 2013-06-02 04:41:08,134 INFO
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
>> Choosing peer a72-246-95-86,60020,1370147274693
>> 2013-06-02 04:41:08,672 DEBUG
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
>> Replicating 1
>> 2013-06-02 04:41:08,971 INFO
>> =
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager:=

>> Going to report log =
#172.25.98.74%2C60020%2C1369903540894.1369921571842 for
>> position
>> 49882 in hdfs://
>>=20
>> =
172.25.98.73:8020/hbase/.logs/172.25.98.74,60020,1369903540894/172.25.98.7=
4%2C60020%2C1369903540894.1369921571842
>> 2013-06-02 04:41:08,973 DEBUG
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
>> Replicated in total: 1
>> 2013-06-02 04:41:08,973 DEBUG
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: =
Opening
>> log for replication =
172.25.98.74%2C60020%2C1369903540894.1369921571842 at
>> 49882
>> 2013-06-02 04:41:08,977 DEBUG
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
>> currentNbOperations:0 and seenEntries:4 and size: 0
>> 2013-06-02 04:41:08,977 INFO
>> =
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager:=

>> Going to report log =
#172.25.98.74%2C60020%2C1369903540894.1369921571842 for
>> position
>> 69368 in hdfs://
>>=20
>> =
172.25.98.73:8020/hbase/.logs/172.25.98.74,60020,1369903540894/172.25.98.7=
4%2C60020%2C1369903540894.1369921571842
>> 2013-06-02 04:41:08,978 INFO
>> =
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager:=

>> Removing 0 logs in the list: []
>> 2013-06-02 04:41:08,978 DEBUG
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: =
Nothing
>> to replicate, sleeping 1000 times 1
>> 2013-06-02 04:41:09,978 DEBUG
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: =
Opening
>> log for replication =
172.25.98.74%2C60020%2C1369903540894.1369921571842 at
>> 69368
>> 2013-06-02 04:41:09,980 DEBUG
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
>> currentNbOperations:0 and seenEntries:0 and size: 0
>> 2013-06-02 04:41:09,980 DEBUG
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: =
Opening
>> log for replication =
172.25.98.74%2C60020%2C1369903540894.1369925171871 at 0
>> 2013-06-02 04:41:09,981 WARN
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: c =
Got:
>> java.io.EOFException
>>        at java.io.DataInputStream.readFully(DataInputStream.java:180)
>>        at java.io.DataInputStream.readFully(DataInputStream.java:152)
>>        at
>> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508)
>>        at
>> =
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1486)
>>        at
>> =
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1475)
>>        at
>> =
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1470)
>>        at
>>=20
>> =
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<=
init>(SequenceFileLogReader.java:55)
>>        at
>>=20
>> =
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(Sequen=
ceFileLogReader.java:177)
>>        at
>> =
org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:742)
>>        at
>>=20
>> =
org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderMana=
ger.openReader(ReplicationHLogReaderManager.java:67)
>>        at
>>=20
>> =
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openRea=
der(ReplicationSource.java:563)
>>        at
>>=20
>> =
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(Rep=
licationSource.java:369)
>> 2013-06-02 04:41:09,981 DEBUG
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: =
Unable
>> to open a reader, sleeping 1000 times 1
>> 2013-06-02 04:41:10,981 DEBUG
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: =
Opening
>> log for replication =
172.25.98.74%2C60020%2C1369903540894.1369925171871 at 0
>>=20
>>=20
>> Ok, so I took a look in HDFS for this logs, and here's what I had in =
store
>> for RS 74:
>>=20
>> adoop@172.25.98.73:~$ hadoop fs -ls /hbase/.logs
>> Found 3 items
>> drwxr-xr-x   - hadoop supergroup          0 2013-05-30 14:46 =
/hbase/.logs/
>> 172.25.98.74,60020,1369903540894
>> drwxr-xr-x   - hadoop supergroup          0 2013-05-30 12:46 =
/hbase/.logs/
>> 172.25.98.75,60020,1369903540900
>> drwxr-xr-x   - hadoop supergroup          0 2013-05-30 14:46 =
/hbase/.logs/
>> 172.25.98.76,60020,1369903540898
>> hadoop@172.25.98.73:~$ hadoop fs -ls /hbase/.logs//hbase/.logs/
>> 172.25.98.74
>> ,60020,1369903540894
>> hadoop@172.25.98.73:~$ hadoop fs -ls /hbase/.logs/172.25.98.74
>> ,60020,1369903540894
>> Found 5 items
>> -rw-r--r--   3 hadoop supergroup       1380 2013-05-30 08:46 =
/hbase/.logs/
>> 172.25.98.74,60020,1369903540894/172.25.98.74
>> %2C60020%2C1369903540894.1369903571677
>> -rw-r--r--   3 hadoop supergroup      59110 2013-05-30 09:46 =
/hbase/.logs/
>> 172.25.98.74,60020,1369903540894/172.25.98.74
>> %2C60020%2C1369903540894.1369907171734
>> -rw-r--r--   3 hadoop supergroup      80508 2013-05-30 12:46 =
/hbase/.logs/
>> 172.25.98.74,60020,1369903540894/172.25.98.74
>> %2C60020%2C1369903540894.1369917971811
>> -rw-r--r--   3 hadoop supergroup      69368 2013-05-30 13:46 =
/hbase/.logs/
>> 172.25.98.74,60020,1369903540894/172.25.98.74
>> %2C60020%2C1369903540894.1369921571842
>> -rw-r--r--   3 hadoop supergroup          0 2013-05-30 14:46 =
/hbase/.logs/
>> 172.25.98.74,60020,1369903540894/172.25.98.74
>> %2C60020%2C1369903540894.1369925171871
>>=20
>> So I saw the log ending with 871 is of 0 length - so I've decided to =
delete
>> it from ZK and restart RS 74, so its in-memory list will get =
refreshed from
>> ZK.
>>=20
>> After that:
>>=20
>>   - I lost the logs for /hbase/.logs/172.25.98.74,60020,1369903540894 =
-
>>   Meaning I lost data
>>=20
>> This is HDFS afterwards:
>>=20
>> hadoop@172.25.98.73:~$ hadoop fs -ls /hbase/.logs/
>>=20
>> Found 3 items
>> drwxr-xr-x   - hadoop supergroup          0 2013-06-02 05:02 =
/hbase/.logs/
>> 172.25.98.74,60020,1370149320299
>> drwxr-xr-x   - hadoop supergroup          0 2013-05-30 12:46 =
/hbase/.logs/
>> 172.25.98.75,60020,1369903540900
>> drwxr-xr-x   - hadoop supergroup          0 2013-05-30 14:46 =
/hbase/.logs/
>> 172.25.98.76,60020,1369903540898
>> hadoop@172.25.98.73:~$ hadoop fs -ls /hbase/.logs/172.25.98.74
>> ,60020,1370149320299
>> Found 1 items
>> -rw-r--r--   3 hadoop supergroup          0 2013-06-02 05:02 =
/hbase/.logs/
>> 172.25.98.74,60020,1370149320299/172.25.98.74
>> %2C60020%2C1370149320299.1370149321515
>>=20
>>=20
>>   - RS 74 is now stuck at same exception, but on the new WAL which =
just
>>   opened but it empty (since no data has been pushed yet):
>>=20
>> 2013-06-02 05:19:18,475 WARN
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: =
Waited
>> too long for this file, considering dumping
>> 2013-06-02 05:19:18,475 DEBUG
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: =
Unable
>> to open a reader, sleeping 1000 times 10
>> 2013-06-02 05:19:28,475 DEBUG
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: =
Opening
>> log for replication =
172.25.98.74%2C60020%2C1370149320299.1370149321515 at 0
>> 2013-06-02 05:19:28,477 WARN
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: c =
Got:
>> java.io.EOFException
>>        at java.io.DataInputStream.readFully(DataInputStream.java:180)
>>        at java.io.DataInputStream.readFully(DataInputStream.java:152)
>>        at
>> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508)
>>        at
>> =
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1486)
>>        at
>> =
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1475)
>>        at
>> =
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1470)
>>        at
>>=20
>> =
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<=
init>(SequenceFileLogReader.java:55)
>>        at
>>=20
>> =
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(Sequen=
ceFileLogReader.java:177)
>>        at
>> =
org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:742)
>>        at
>>=20
>> =
org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderMana=
ger.openReader(ReplicationHLogReaderManager.java:67)
>>        at
>>=20
>> =
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openRea=
der(ReplicationSource.java:563)
>>        at
>>=20
>> =
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(Rep=
licationSource.java:369)
>> 2013-06-02 05:19:28,477 WARN
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: =
Waited
>> too long for this file, considering dumping
>> 2013-06-02 05:19:28,477 DEBUG
>> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: =
Unable
>> to open a reader, sleeping 1000 times 10
>>=20
>>=20
>> So, after this long explanation - is this a bug?
>>=20