hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-644) DroppedSnapshotException but RegionServer doesn't restart
Date Sun, 25 May 2008 20:39:55 GMT

    [ https://issues.apache.org/jira/browse/HBASE-644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12599717#action_12599717
] 

stack commented on HBASE-644:
-----------------------------

Here is illustration of the hang from Daniel's logs up on EC2:

{code}2008-05-19 15:26:17,985 INFO org.apache.hadoop.hbase.HRegion: Blocking updates for 'IPC
Server handler 8 on 60020': Memcache size 64.0m is >= than blocking 64.0m size
2008-05-19 15:26:34,556 INFO org.apache.hadoop.dfs.DFSClient: Exception in createBlockOutputStream
java.net.SocketTimeoutException: Read timed out
2008-05-19 15:26:34,558 INFO org.apache.hadoop.dfs.DFSClient: Abandoning block blk_4812312891610152736
2008-05-19 15:26:34,562 INFO org.apache.hadoop.dfs.DFSClient: Waiting to find target node:
10.254.26.31:50010
2008-05-19 15:26:40,567 WARN org.apache.hadoop.dfs.DFSClient: DataStreamer Exception: java.io.IOException:
Unable to create new block.
2008-05-19 15:26:40,567 WARN org.apache.hadoop.dfs.DFSClient: Error Recovery for block blk_4812312891610152736
bad datanode[0]
2008-05-19 15:26:40,567 FATAL org.apache.hadoop.hbase.HRegionServer: Replay of hlog required.
Forcing server restart
org.apache.hadoop.hbase.DroppedSnapshotException: Could not get block locations. Aborting...
    at org.apache.hadoop.hbase.HRegion.internalFlushcache(HRegion.java:1115)
    at org.apache.hadoop.hbase.HRegion.flushcache(HRegion.java:1020)
    at org.apache.hadoop.hbase.HRegionServer$Flusher.flushRegion(HRegionServer.java:447)
    at org.apache.hadoop.hbase.HRegionServer$Flusher.run(HRegionServer.java:390)
2008-05-19 15:26:40,573 INFO org.apache.hadoop.ipc.Server: Stopping server on 60020
2008-05-19 15:26:40,573 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on
60020
2008-05-19 15:26:40,574 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
2008-05-19 15:26:40,574 INFO org.apache.hadoop.hbase.HRegionServer: regionserver/0.0.0.0:60020.cacheFlusher
exiting
..
{code}
Above, we see the decision in Flusher to restart but notice that there is no ("Shutting down
HRegionServer: file system not available" message from checkFilesystem.  There is the shutdown
of the RPC.

The regionserver stays like until there is human intervention 20 minutes later.  We even try
a compaction while we're in this state.

{code}
2008-05-19 15:28:11,609 ERROR org.apache.hadoop.hbase.HRegionServer: Compaction failed for
region categories,,1210801647493
java.io.IOException: Could not get block locations. Aborting...
    at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:1832)
    at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1100(DFSClient.java:1487)
    at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1579)
2008-05-19 15:28:11,614 INFO org.apache.hadoop.hbase.HRegion: checking compaction on region
categories,2864153,1211005494348
2008-05-19 15:28:11,619 DEBUG org.apache.hadoop.hbase.HStore: started compaction of 26 files
[1060231198/parent_categories/598305439430938634, 1060231198/parent_categories/4425107540345282973,
1060231198/parent_categories/1614092432458686638, 1060231198/parent_categories/2207075305093635923,
1060231198/parent_categories/4896580642418276150, 1060231198/parent_categories/3771164238635045810,
1060231198/parent_categories/8367575636975706877, 1060231198/parent_categories/563936161407142553,
1060231198/parent_categories/7201129332580659551, 1060231198/parent_categories/6504265075252363889,
1060231198/parent_categories/6507030234222769907, 1060231198/parent_categories/681881762013309210,
1060231198/parent_categories/178568648155148325, 1060231198/parent_categories/4045949900443465272,
1060231198/parent_categories/3620331467236868822, 1060231198/parent_categories/1171065084128878397,
1060231198/parent_categories/5412012975309666204, 1060231198/parent_categories/3257536290236063841,
1060231198/parent_categories/8332496624429761938, 1060231198/parent_categories/4487341413361548197,
1060231198/parent_categories/3927839157823465388, 1060231198/parent_categories/3727860104887831904,
1060231198/parent_categories/4427275327985154943, 1060231198/parent_categories/6572980649594273108,
1060231198/parent_categories/3726725692879501830, 1060231198/parent_categories/7261722476991845495]
into /hbase/categories/compaction.dir/1060231198/parent_categories/mapfiles/5850576708252468402
2008-05-19 15:47:30,471 INFO org.apache.hadoop.hbase.HRegionServer: Got quiesce server message
..
{code}


> DroppedSnapshotException but RegionServer doesn't restart
> ---------------------------------------------------------
>
>                 Key: HBASE-644
>                 URL: https://issues.apache.org/jira/browse/HBASE-644
>             Project: Hadoop HBase
>          Issue Type: Bug
>            Reporter: stack
>            Priority: Blocker
>             Fix For: 0.1.3, 0.2.0
>
>         Attachments: 644.patch
>
>
> RegionServer was carrying -ROOT- and having trouble writing HDFS.  RegionServer judged
that a flush failed and reported a DroppedSnapshotException.  Usually, the filesystem check
would fail and set all the abort flags but it in this case filesystem somehow returned healthy
and the flags were not set.  The code path shutdown the RPC only and exited then we exited
the Flusher.  All the rest of the RegionServer stayed up and kept reporting the master.  The
master thought it alive and kept trying to scan the unreachable -ROOT-.  Cluster was hosed
until manual intervention 20 minutes later.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message