hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bryan Duxbury (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-497) RegionServer needs to recover if datanode goes down
Date Sat, 08 Mar 2008 02:03:46 GMT

    [ https://issues.apache.org/jira/browse/HBASE-497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12576469#action_12576469

Bryan Duxbury commented on HBASE-497:

At least in 0.1, where we don't have appends to logs, when we get an error trying to append
to a log, there's no real way to recover the lost log data. This is because HDFS files don't
exist until they're closed. (See HADOOP-1700)

Our options are:
 * Bail the regionserver. There's been an exception we shouldn't really ever get, and it's
bad. Let it get worked out by restarting.
 * Bail the regionserver, but also try to flush the caches first. This has the advantage of
saving the data already written to caches, if possible. Might end up with a convoluted flow
to make it happen.
 * Open a new log like nothing ever happened. We'll have lost the updates since the last log
roll, but who cares, since there's nothing we can do to recover it, period.
 * Change logging to log to a local file as well as the HDFS file. Then, if there's an exception
at any point writing to the HDFS log, we can copy the local version of the log up to HDFS
and keep appending. This gives us some resilience to datanode failures, but doesn't really
make our logs any more useful in the case of dying machines or network partitions. It's also
a lot of new functionality, which doesn't exactly fit with the goals of 0.1 (bugfixes only).

Of these options, I think the best one is to just open a new log. This will keep our regionserver
online and let us carry on with the minimum of difficulty. Does this seem like enough of a
fix to satisfy the 0.1 release block?

> RegionServer needs to recover if datanode goes down
> ---------------------------------------------------
>                 Key: HBASE-497
>                 URL: https://issues.apache.org/jira/browse/HBASE-497
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.16.0
>            Reporter: Michael Bieniosek
>            Priority: Blocker
>             Fix For: 0.1.0, 0.2.0
> If I take down a datanode, the regionserver will repeatedly return this error:
> java.io.IOException: Stream closed.
>         at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.isClosed(DFSClient.java:1875)
>         at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.writeChunk(DFSClient.java:2096)
>         at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:141)
>         at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:124)
>         at org.apache.hadoop.fs.FSOutputSummer.write1(FSOutputSummer.java:112)
>         at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:86)
>         at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:41)
>         at java.io.DataOutputStream.write(Unknown Source)
>         at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:977)
>         at org.apache.hadoop.hbase.HLog.append(HLog.java:377)
>         at org.apache.hadoop.hbase.HRegion.update(HRegion.java:1455)
>         at org.apache.hadoop.hbase.HRegion.batchUpdate(HRegion.java:1259)
>         at org.apache.hadoop.hbase.HRegionServer.batchUpdate(HRegionServer.java:1433)
>         at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>         at java.lang.reflect.Method.invoke(Unknown Source)
>         at org.apache.hadoop.hbase.ipc.HbaseRPC$Server.call(HbaseRPC.java:413)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:910)
> It appears that hbase/dfsclient does not attempt to reopen the stream.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message