hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-3604) Two region servers think that they own the same region: data loss
Date Fri, 04 Mar 2011 18:40:37 GMT

    [ https://issues.apache.org/jira/browse/HBASE-3604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13002772#comment-13002772

stack commented on HBASE-3604:

@Dhruba It'd be interesting to look at server A logs if you ever get on to it.  Do you see
logging of its expired zk session?  We should add verification of zk connection before doing
any fs file move?  

> Two region servers think that they own the same region: data loss
> -----------------------------------------------------------------
>                 Key: HBASE-3604
>                 URL: https://issues.apache.org/jira/browse/HBASE-3604
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.0
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
> I observed this on a 100 node cluster that is constantly doing about 500K ops/second.
> The region server on machine A was servicing IOs for a particular region. Then the machine
went into a bad state where it is ping-able but not ssh-able. The master detected that there
is a problem with machine A and reassigned the region to machine B. The regionserver on machine
B opened the region and opened all the required HFiles for this region. After two hours, the
NameNode received a delete request for one of the HFiles from machine A and happily renamed
the file to HDFS-Trash. After another 3 hours or so, the regionserver on machine B tried to
read contents from that HFile but failed because the file was renamed earlier. The region
server on B in now stuck, and possible data loss. 
> The problems stems from the fact that although the master-and-ZK reassigned the region,
the old regionserver was not possibly dead.

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message