hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Asaf Mesika <asaf.mes...@gmail.com>
Subject Weird Replication exception
Date Sun, 02 Jun 2013 05:20:58 GMT
Hi,

I have a weird error in a cluster I'm checking Replication with.

I have two clusters set up, each on its own DC (different continents). Each
has 1 master, and 3 RS.

I've done all required setup, started replication and pushed in some data
into the master. I had an issue where the slave (peer) cluster went dead
(all RS failed contacting the master), thus replication couldn't work. This
happened right before the weekend, so it was out for 3 days.

Now I'm back in the office - got slave cluster back up (just the RS), and I
got some nasty exception in one of the RS of the master cluster:

2013-06-02 04:40:45,903 INFO
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Getting
0 rs from peer cluster # c
2013-06-02 04:40:45,903 INFO
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Slave
cluster looks down: c has 0 region servers
2013-06-02 04:40:46,903 DEBUG
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Since
we are unable to replicate, sleeping 1000 times 10
2013-06-02 04:40:57,019 INFO
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Getting
0 rs from peer cluster # c
2013-06-02 04:40:57,019 INFO
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Slave
cluster looks down: c has 0 region servers
2013-06-02 04:40:58,019 DEBUG
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Since
we are unable to replicate, sleeping 1000 times 10
2013-06-02 04:41:08,134 INFO
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Getting
1 rs from peer cluster # c
2013-06-02 04:41:08,134 INFO
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
Choosing peer a72-246-95-86,60020,1370147274693
2013-06-02 04:41:08,672 DEBUG
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
Replicating 1
2013-06-02 04:41:08,971 INFO
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager:
Going to report log #172.25.98.74%2C60020%2C1369903540894.1369921571842 for
position
 49882 in hdfs://
172.25.98.73:8020/hbase/.logs/172.25.98.74,60020,1369903540894/172.25.98.74%2C60020%2C1369903540894.1369921571842
2013-06-02 04:41:08,973 DEBUG
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
Replicated in total: 1
2013-06-02 04:41:08,973 DEBUG
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening
log for replication 172.25.98.74%2C60020%2C1369903540894.1369921571842 at
49882
2013-06-02 04:41:08,977 DEBUG
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
currentNbOperations:0 and seenEntries:4 and size: 0
2013-06-02 04:41:08,977 INFO
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager:
Going to report log #172.25.98.74%2C60020%2C1369903540894.1369921571842 for
position
 69368 in hdfs://
172.25.98.73:8020/hbase/.logs/172.25.98.74,60020,1369903540894/172.25.98.74%2C60020%2C1369903540894.1369921571842
2013-06-02 04:41:08,978 INFO
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager:
Removing 0 logs in the list: []
2013-06-02 04:41:08,978 DEBUG
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Nothing
to replicate, sleeping 1000 times 1
2013-06-02 04:41:09,978 DEBUG
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening
log for replication 172.25.98.74%2C60020%2C1369903540894.1369921571842 at
69368
2013-06-02 04:41:09,980 DEBUG
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
currentNbOperations:0 and seenEntries:0 and size: 0
2013-06-02 04:41:09,980 DEBUG
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening
log for replication 172.25.98.74%2C60020%2C1369903540894.1369925171871 at 0
2013-06-02 04:41:09,981 WARN
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: c Got:
java.io.EOFException
        at java.io.DataInputStream.readFully(DataInputStream.java:180)
        at java.io.DataInputStream.readFully(DataInputStream.java:152)
        at
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508)
        at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1486)
        at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1475)
        at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1470)
        at
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55)
        at
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:177)
        at
org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:742)
        at
org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:67)
        at
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:563)
        at
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:369)
2013-06-02 04:41:09,981 DEBUG
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Unable
to open a reader, sleeping 1000 times 1
2013-06-02 04:41:10,981 DEBUG
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening
log for replication 172.25.98.74%2C60020%2C1369903540894.1369925171871 at 0


Ok, so I took a look in HDFS for this logs, and here's what I had in store
for RS 74:

adoop@172.25.98.73:~$ hadoop fs -ls /hbase/.logs
Found 3 items
drwxr-xr-x   - hadoop supergroup          0 2013-05-30 14:46 /hbase/.logs/
172.25.98.74,60020,1369903540894
drwxr-xr-x   - hadoop supergroup          0 2013-05-30 12:46 /hbase/.logs/
172.25.98.75,60020,1369903540900
drwxr-xr-x   - hadoop supergroup          0 2013-05-30 14:46 /hbase/.logs/
172.25.98.76,60020,1369903540898
hadoop@172.25.98.73:~$ hadoop fs -ls /hbase/.logs//hbase/.logs/172.25.98.74
,60020,1369903540894
hadoop@172.25.98.73:~$ hadoop fs -ls /hbase/.logs/172.25.98.74
,60020,1369903540894
Found 5 items
-rw-r--r--   3 hadoop supergroup       1380 2013-05-30 08:46 /hbase/.logs/
172.25.98.74,60020,1369903540894/172.25.98.74
%2C60020%2C1369903540894.1369903571677
-rw-r--r--   3 hadoop supergroup      59110 2013-05-30 09:46 /hbase/.logs/
172.25.98.74,60020,1369903540894/172.25.98.74
%2C60020%2C1369903540894.1369907171734
-rw-r--r--   3 hadoop supergroup      80508 2013-05-30 12:46 /hbase/.logs/
172.25.98.74,60020,1369903540894/172.25.98.74
%2C60020%2C1369903540894.1369917971811
-rw-r--r--   3 hadoop supergroup      69368 2013-05-30 13:46 /hbase/.logs/
172.25.98.74,60020,1369903540894/172.25.98.74
%2C60020%2C1369903540894.1369921571842
-rw-r--r--   3 hadoop supergroup          0 2013-05-30 14:46 /hbase/.logs/
172.25.98.74,60020,1369903540894/172.25.98.74
%2C60020%2C1369903540894.1369925171871

So I saw the log ending with 871 is of 0 length - so I've decided to delete
it from ZK and restart RS 74, so its in-memory list will get refreshed from
ZK.

After that:

   - I lost the logs for /hbase/.logs/172.25.98.74,60020,1369903540894 -
   Meaning I lost data

This is HDFS afterwards:

hadoop@172.25.98.73:~$ hadoop fs -ls /hbase/.logs/

Found 3 items
drwxr-xr-x   - hadoop supergroup          0 2013-06-02 05:02 /hbase/.logs/
172.25.98.74,60020,1370149320299
drwxr-xr-x   - hadoop supergroup          0 2013-05-30 12:46 /hbase/.logs/
172.25.98.75,60020,1369903540900
drwxr-xr-x   - hadoop supergroup          0 2013-05-30 14:46 /hbase/.logs/
172.25.98.76,60020,1369903540898
hadoop@172.25.98.73:~$ hadoop fs -ls /hbase/.logs/172.25.98.74
,60020,1370149320299
Found 1 items
-rw-r--r--   3 hadoop supergroup          0 2013-06-02 05:02 /hbase/.logs/
172.25.98.74,60020,1370149320299/172.25.98.74
%2C60020%2C1370149320299.1370149321515


   - RS 74 is now stuck at same exception, but on the new WAL which just
   opened but it empty (since no data has been pushed yet):

2013-06-02 05:19:18,475 WARN
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Waited
too long for this file, considering dumping
2013-06-02 05:19:18,475 DEBUG
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Unable
to open a reader, sleeping 1000 times 10
2013-06-02 05:19:28,475 DEBUG
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Opening
log for replication 172.25.98.74%2C60020%2C1370149320299.1370149321515 at 0
2013-06-02 05:19:28,477 WARN
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: c Got:
java.io.EOFException
        at java.io.DataInputStream.readFully(DataInputStream.java:180)
        at java.io.DataInputStream.readFully(DataInputStream.java:152)
        at
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508)
        at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1486)
        at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1475)
        at
org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1470)
        at
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55)
        at
org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:177)
        at
org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:742)
        at
org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:67)
        at
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:563)
        at
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:369)
2013-06-02 05:19:28,477 WARN
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Waited
too long for this file, considering dumping
2013-06-02 05:19:28,477 DEBUG
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: Unable
to open a reader, sleeping 1000 times 10


So, after this long explanation - is this a bug?

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message