hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Yu <yuzhih...@gmail.com>
Subject Re: Weird Replication exception
Date Sun, 02 Jun 2013 19:38:49 GMT
bq. Is 0.94.8 production ready?

I think so. Lars released 0.94.8 Friday evening.

On Sun, Jun 2, 2013 at 12:26 PM, Asaf Mesika <asaf.mesika@gmail.com> wrote:

> I use 0.94.7.
> Is 0.94.8 production ready?
>
> So in summary I have two issues:
> 1. Clocks are out of sync
> 2. I need to upgrade to 0.94.8 to avoid seeing this WARN messages?
>
> On Jun 2, 2013, at 5:46 PM, Ted Yu <yuzhihong@gmail.com> wrote:
>
> > What is the HBase version you're using ?
> >
> > In another thread, I mentioned this:
> >
> > There was a recently integrated JIRA (0.94.8):
> > HBASE-7122 Proper warning message when opening a log file with no entries
> > (idle cluster)
> >
> > Does the HBase you're using contain HBASE-7122 ?
> >
> > Cheers
> >
> > On Sat, Jun 1, 2013 at 10:20 PM, Asaf Mesika <asaf.mesika@gmail.com>
> wrote:
> >
> >> Hi,
> >>
> >> I have a weird error in a cluster I'm checking Replication with.
> >>
> >> I have two clusters set up, each on its own DC (different continents).
> Each
> >> has 1 master, and 3 RS.
> >>
> >> I've done all required setup, started replication and pushed in some
> data
> >> into the master. I had an issue where the slave (peer) cluster went dead
> >> (all RS failed contacting the master), thus replication couldn't work.
> This
> >> happened right before the weekend, so it was out for 3 days.
> >>
> >> Now I'm back in the office - got slave cluster back up (just the RS),
> and I
> >> got some nasty exception in one of the RS of the master cluster:
> >>
> >> 2013-06-02 04:40:45,903 INFO
> >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Getting
> >> 0 rs from peer cluster # c
> >> 2013-06-02 04:40:45,903 INFO
> >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Slave
> >> cluster looks down: c has 0 region servers
> >> 2013-06-02 04:40:46,903 DEBUG
> >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Since
> >> we are unable to replicate, sleeping 1000 times 10
> >> 2013-06-02 04:40:57,019 INFO
> >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Getting
> >> 0 rs from peer cluster # c
> >> 2013-06-02 04:40:57,019 INFO
> >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Slave
> >> cluster looks down: c has 0 region servers
> >> 2013-06-02 04:40:58,019 DEBUG
> >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Since
> >> we are unable to replicate, sleeping 1000 times 10
> >> 2013-06-02 04:41:08,134 INFO
> >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Getting
> >> 1 rs from peer cluster # c
> >> 2013-06-02 04:41:08,134 INFO
> >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> >> Choosing peer a72-246-95-86,60020,1370147274693
> >> 2013-06-02 04:41:08,672 DEBUG
> >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> >> Replicating 1
> >> 2013-06-02 04:41:08,971 INFO
> >>
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager:
> >> Going to report log #172.25.98.74%2C60020%2C1369903540894.1369921571842
> for
> >> position
> >> 49882 in hdfs://
> >>
> >>
> 172.25.98.73:8020/hbase/.logs/172.25.98.74,60020,1369903540894/172.25.98.74%2C60020%2C1369903540894.1369921571842
> >> 2013-06-02 04:41:08,973 DEBUG
> >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> >> Replicated in total: 1
> >> 2013-06-02 04:41:08,973 DEBUG
> >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Opening
> >> log for replication 172.25.98.74%2C60020%2C1369903540894.1369921571842
> at
> >> 49882
> >> 2013-06-02 04:41:08,977 DEBUG
> >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> >> currentNbOperations:0 and seenEntries:4 and size: 0
> >> 2013-06-02 04:41:08,977 INFO
> >>
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager:
> >> Going to report log #172.25.98.74%2C60020%2C1369903540894.1369921571842
> for
> >> position
> >> 69368 in hdfs://
> >>
> >>
> 172.25.98.73:8020/hbase/.logs/172.25.98.74,60020,1369903540894/172.25.98.74%2C60020%2C1369903540894.1369921571842
> >> 2013-06-02 04:41:08,978 INFO
> >>
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager:
> >> Removing 0 logs in the list: []
> >> 2013-06-02 04:41:08,978 DEBUG
> >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Nothing
> >> to replicate, sleeping 1000 times 1
> >> 2013-06-02 04:41:09,978 DEBUG
> >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Opening
> >> log for replication 172.25.98.74%2C60020%2C1369903540894.1369921571842
> at
> >> 69368
> >> 2013-06-02 04:41:09,980 DEBUG
> >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> >> currentNbOperations:0 and seenEntries:0 and size: 0
> >> 2013-06-02 04:41:09,980 DEBUG
> >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Opening
> >> log for replication 172.25.98.74%2C60020%2C1369903540894.1369925171871
> at 0
> >> 2013-06-02 04:41:09,981 WARN
> >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: c
> Got:
> >> java.io.EOFException
> >>        at java.io.DataInputStream.readFully(DataInputStream.java:180)
> >>        at java.io.DataInputStream.readFully(DataInputStream.java:152)
> >>        at
> >> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508)
> >>        at
> >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1486)
> >>        at
> >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1475)
> >>        at
> >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1470)
> >>        at
> >>
> >>
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55)
> >>        at
> >>
> >>
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:177)
> >>        at
> >> org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:742)
> >>        at
> >>
> >>
> org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:67)
> >>        at
> >>
> >>
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:563)
> >>        at
> >>
> >>
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:369)
> >> 2013-06-02 04:41:09,981 DEBUG
> >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Unable
> >> to open a reader, sleeping 1000 times 1
> >> 2013-06-02 04:41:10,981 DEBUG
> >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Opening
> >> log for replication 172.25.98.74%2C60020%2C1369903540894.1369925171871
> at 0
> >>
> >>
> >> Ok, so I took a look in HDFS for this logs, and here's what I had in
> store
> >> for RS 74:
> >>
> >> adoop@172.25.98.73:~$ hadoop fs -ls /hbase/.logs
> >> Found 3 items
> >> drwxr-xr-x   - hadoop supergroup          0 2013-05-30 14:46
> /hbase/.logs/
> >> 172.25.98.74,60020,1369903540894
> >> drwxr-xr-x   - hadoop supergroup          0 2013-05-30 12:46
> /hbase/.logs/
> >> 172.25.98.75,60020,1369903540900
> >> drwxr-xr-x   - hadoop supergroup          0 2013-05-30 14:46
> /hbase/.logs/
> >> 172.25.98.76,60020,1369903540898
> >> hadoop@172.25.98.73:~$ hadoop fs -ls /hbase/.logs//hbase/.logs/
> >> 172.25.98.74
> >> ,60020,1369903540894
> >> hadoop@172.25.98.73:~$ hadoop fs -ls /hbase/.logs/172.25.98.74
> >> ,60020,1369903540894
> >> Found 5 items
> >> -rw-r--r--   3 hadoop supergroup       1380 2013-05-30 08:46
> /hbase/.logs/
> >> 172.25.98.74,60020,1369903540894/172.25.98.74
> >> %2C60020%2C1369903540894.1369903571677
> >> -rw-r--r--   3 hadoop supergroup      59110 2013-05-30 09:46
> /hbase/.logs/
> >> 172.25.98.74,60020,1369903540894/172.25.98.74
> >> %2C60020%2C1369903540894.1369907171734
> >> -rw-r--r--   3 hadoop supergroup      80508 2013-05-30 12:46
> /hbase/.logs/
> >> 172.25.98.74,60020,1369903540894/172.25.98.74
> >> %2C60020%2C1369903540894.1369917971811
> >> -rw-r--r--   3 hadoop supergroup      69368 2013-05-30 13:46
> /hbase/.logs/
> >> 172.25.98.74,60020,1369903540894/172.25.98.74
> >> %2C60020%2C1369903540894.1369921571842
> >> -rw-r--r--   3 hadoop supergroup          0 2013-05-30 14:46
> /hbase/.logs/
> >> 172.25.98.74,60020,1369903540894/172.25.98.74
> >> %2C60020%2C1369903540894.1369925171871
> >>
> >> So I saw the log ending with 871 is of 0 length - so I've decided to
> delete
> >> it from ZK and restart RS 74, so its in-memory list will get refreshed
> from
> >> ZK.
> >>
> >> After that:
> >>
> >>   - I lost the logs for /hbase/.logs/172.25.98.74,60020,1369903540894 -
> >>   Meaning I lost data
> >>
> >> This is HDFS afterwards:
> >>
> >> hadoop@172.25.98.73:~$ hadoop fs -ls /hbase/.logs/
> >>
> >> Found 3 items
> >> drwxr-xr-x   - hadoop supergroup          0 2013-06-02 05:02
> /hbase/.logs/
> >> 172.25.98.74,60020,1370149320299
> >> drwxr-xr-x   - hadoop supergroup          0 2013-05-30 12:46
> /hbase/.logs/
> >> 172.25.98.75,60020,1369903540900
> >> drwxr-xr-x   - hadoop supergroup          0 2013-05-30 14:46
> /hbase/.logs/
> >> 172.25.98.76,60020,1369903540898
> >> hadoop@172.25.98.73:~$ hadoop fs -ls /hbase/.logs/172.25.98.74
> >> ,60020,1370149320299
> >> Found 1 items
> >> -rw-r--r--   3 hadoop supergroup          0 2013-06-02 05:02
> /hbase/.logs/
> >> 172.25.98.74,60020,1370149320299/172.25.98.74
> >> %2C60020%2C1370149320299.1370149321515
> >>
> >>
> >>   - RS 74 is now stuck at same exception, but on the new WAL which just
> >>   opened but it empty (since no data has been pushed yet):
> >>
> >> 2013-06-02 05:19:18,475 WARN
> >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Waited
> >> too long for this file, considering dumping
> >> 2013-06-02 05:19:18,475 DEBUG
> >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Unable
> >> to open a reader, sleeping 1000 times 10
> >> 2013-06-02 05:19:28,475 DEBUG
> >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Opening
> >> log for replication 172.25.98.74%2C60020%2C1370149320299.1370149321515
> at 0
> >> 2013-06-02 05:19:28,477 WARN
> >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource: c
> Got:
> >> java.io.EOFException
> >>        at java.io.DataInputStream.readFully(DataInputStream.java:180)
> >>        at java.io.DataInputStream.readFully(DataInputStream.java:152)
> >>        at
> >> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1508)
> >>        at
> >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1486)
> >>        at
> >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1475)
> >>        at
> >> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1470)
> >>        at
> >>
> >>
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.<init>(SequenceFileLogReader.java:55)
> >>        at
> >>
> >>
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:177)
> >>        at
> >> org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:742)
> >>        at
> >>
> >>
> org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:67)
> >>        at
> >>
> >>
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:563)
> >>        at
> >>
> >>
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:369)
> >> 2013-06-02 05:19:28,477 WARN
> >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Waited
> >> too long for this file, considering dumping
> >> 2013-06-02 05:19:28,477 DEBUG
> >> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource:
> Unable
> >> to open a reader, sleeping 1000 times 10
> >>
> >>
> >> So, after this long explanation - is this a bug?
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message