hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cosmin Lehene (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-3660) If regions assignment fails, clients will be directed to stale data from .META.
Date Fri, 18 Mar 2011 10:55:29 GMT

    [ https://issues.apache.org/jira/browse/HBASE-3660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008394#comment-13008394
] 

Cosmin Lehene commented on HBASE-3660:
--------------------------------------

LZO not working would indeed be a bigger problem.
However I mentioned it (LZO) because it was easier to spot that way, but it's not necessary
to cause the problem.

The questions is: is it ok when a region is unavailable to have clients contacting other region
servers? I was thinking this could lead to other problems. The solution I was thinking about
was not to remove the old server address from .META. but to mark that the region is not actually
deployed. 

I'm seeing this on my laptop when I switch networks. I retested a network switch. 
Shutdown everything in network A (192.168.2.0)
Start everything (including ZK and HDFS) in network B (10.131.171.0) 

When starting HBase I get this:

in HMaster:

2011-03-18 11:40:38,953 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: hlog file
splitting completed in 7944 ms for hdfs://localhost:9000/hbase/.logs/192.168.2.102,60020,1300389033686
2011-03-18 11:40:58,998 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting to server:
192.168.2.102/192.168.2.102:60020
2011-03-18 11:41:20,000 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting to server:
192.168.2.102/192.168.2.102:60020
2011-03-18 11:41:25,163 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception.
Starting shutdown.
java.net.SocketException: Network is unreachable
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)

Then it shuts down.

In HRegionServer

2011-03-18 11:39:24,138 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Attempting
connect to Master server at 192.168.2.102:60000
2011-03-18 11:39:44,172 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting to server:
192.168.2.102/192.168.2.102:60000
2011-03-18 11:40:05,172 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting to server:
192.168.2.102/192.168.2.102:60000
2011-03-18 11:40:26,174 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting to server:
192.168.2.102/192.168.2.102:60000
2011-03-18 11:40:26,175 WARN org.apache.hadoop.hbase.regionserver.HRegionServer: Unable to
connect to master. Retrying. Error was:
java.net.SocketTimeoutException: 20000 millis timeout while waiting for channel to be ready
for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=192.168.2.102/192.168.2.102:60000]
	at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:213)
...

2011-03-18 11:40:29,180 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Attempting
connect to Master server at 10.131.171.219:60000
2011-03-18 11:40:29,297 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Connected
to master at 10.131.171.219:60000
2011-03-18 11:40:29,300 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Telling master
at 10.131.171.219:60000 that we are up
2011-03-18 11:40:29,329 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Master passed
us address to use. Was=10.131.171.219:60020, Now=10.131.171.219:60020
2011-03-18 11:40:29,331 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: Config from
master: fs.default.name=hdfs://localhost:9000/hbase

...


2011-03-18 11:40:30,784 INFO org.apache.hadoop.ipc.HBaseServer: PRI IPC Server handler 9 on
60020: starting
2011-03-18 11:40:30,784 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Serving as
10.131.171.219,60020,1300441163636, RPC listening on /10.131.171.219:60020, sessionid=0x12ec85503600002
2011-03-18 11:40:30,795 INFO org.apache.hadoop.hbase.regionserver.StoreFile: Allocating LruBlockCache
with maximum size 199.2m
2011-03-18 11:41:27,876 DEBUG org.apache.hadoop.hbase.regionserver.HRegionServer: No master
found, will retry


Since HMaster is dead I start it again:

011-03-18 12:04:32,863 INFO org.apache.hadoop.hbase.master.ServerManager: Waiting on regionserver(s)
count to settle; currently=1
2011-03-18 12:04:34,364 INFO org.apache.hadoop.hbase.master.ServerManager: Finished waiting
for regionserver count to settle; count=1, sleptFor=4500
2011-03-18 12:04:34,364 INFO org.apache.hadoop.hbase.master.ServerManager: Exiting wait on
regionserver(s) to checkin; count=1, stopped=false, count of regions out on cluster=0
2011-03-18 12:04:34,368 INFO org.apache.hadoop.hbase.master.MasterFileSystem: Log folder hdfs://localhost:9000/hbase/.logs/10.131.171.219,60020,1300441163636
belongs to an existing region server
2011-03-18 12:04:54,057 DEBUG org.apache.hadoop.hbase.client.MetaScanner: Scanning .META.
starting at row= for max=2147483647 rows
2011-03-18 12:04:54,063 DEBUG org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
Lookedup root region location, connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@63e708b2;
hsa=192.168.2.102:60020
2011-03-18 12:04:54,390 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting to server:
192.168.2.102/192.168.2.102:60020
2011-03-18 12:05:15,391 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting to server:
192.168.2.102/192.168.2.102:60020
2011-03-18 12:05:36,392 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting to server:
192.168.2.102/192.168.2.102:60020
2011-03-18 12:05:36,393 DEBUG org.apache.hadoop.hbase.catalog.CatalogTracker: Timed out connecting
to 192.168.2.102:60020
2011-03-18 12:05:36,394 INFO org.apache.hadoop.hbase.catalog.RootLocationEditor: Unsetting
ROOT region location in ZooKeeper
2011-03-18 12:05:36,409 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: master:60000-0x12ec85503600004
Creating (or updating) unassigned node for 70236052 with OFFLINE state
2011-03-18 12:05:36,424 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: No previous
transition plan was found (or we are ignoring an existing plan) for -ROOT-,,0.70236052 so
generated a random one; hri=-ROOT-,,0.70236052, src=, dest=10.131.171.219,60020,1300441163636;
1 (online=1, exclude=null) available servers
2011-03-18 12:05:36,425 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Assigning
region -ROOT-,,0.70236052 to 10.131.171.219,60020,1300441163636
2011-03-18 12:05:36,425 DEBUG org.apache.hadoop.hbase.master.ServerManager: New connection
to 10.131.171.219,60020,1300441163636
2011-03-18 12:05:56,395 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting to server:
192.168.2.102/192.168.2.102:60020
^[[B^[[B2011-03-18 12:06:08,899 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions
in transition timed out:  -ROOT-,,0.70236052 state=PENDING_OPEN, ts=1300442736425
2011-03-18 12:06:08,901 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has
been PENDING_OPEN for too long, reassigning region=-ROOT-,,0.70236052
2011-03-18 12:06:08,901 DEBUG org.apache.hadoop.hbase.master.AssignmentManager: Forcing OFFLINE;
was=-ROOT-,,0.70236052 state=PENDING_OPEN, ts=1300442736425
2011-03-18 12:06:17,397 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting to server:
192.168.2.102/192.168.2.102:60020
2011-03-18 12:06:38,399 INFO org.apache.hadoop.ipc.HbaseRPC: Problem connecting to server:
192.168.2.102/192.168.2.102:60020

...


2011-03-18 12:06:57,814 DEBUG org.apache.hadoop.hbase.client.MetaScanner: Scanning .META.
starting at row= for max=2147483647 rows
2011-03-18 12:06:57,817 DEBUG org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
Lookedup root region location, connection=org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation@63e708b2;
hsa=10.131.171.219:60020
2011-03-18 12:06:58,051 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception.
Starting shutdown.
java.net.SocketException: Network is unreachable
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)

HMaster kills itself again. Stopping the regionserver and starting it again with HMAster will
yield the same results. 
And so on. At some point after a few restarts it will start and work (at least until you change
IPs again)

It's not clear (to me) if the stale data is in .META. or if it could be in ZK as well.

My point is that this is not a LZO issue. 


> If regions assignment fails, clients will be directed to stale data from .META.
> -------------------------------------------------------------------------------
>
>                 Key: HBASE-3660
>                 URL: https://issues.apache.org/jira/browse/HBASE-3660
>             Project: HBase
>          Issue Type: Bug
>          Components: master, regionserver
>    Affects Versions: 0.90.1
>            Reporter: Cosmin Lehene
>             Fix For: 0.90.2
>
>
> I've noticed this when the IP on my machine changed (it's even easier to detect when
LZO doesn't work)
> Master loads .META. successfully and then starts assigning regions.
> However LZO doesn't work so HRegionServer can't open the regions. 
> A client attempts to get data from a table so it reads the location from .META. but goes
to a totally different server (the old value in .META.)
> This could happen without the LZO story too. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message