hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dimitri Goldin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-8502) Eternally stuck Region after split
Date Fri, 10 May 2013 14:23:17 GMT

    [ https://issues.apache.org/jira/browse/HBASE-8502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13654504#comment-13654504
] 

Dimitri Goldin commented on HBASE-8502:
---------------------------------------

Yes, during that time some system were overloaded and caused some stability issues. So probably
the split failed because of a timeout.

I checked the DN and NN logs, but they do not contain the period in question (they start at
2013-03-19 11:56) and are pretty useless. They mostly contain the already known FileNotFoundException
and some renaming issues quoted below.

The content of the 79c619508659018ff3ef0887611eb8f7 daughter suggest, that some split either
succeeded despite a previously failed and
improperly rolled back attempt or simply the failed split failed to clean up. So I do believe,
that HBase deleted the parent region itself.

My suspicion was that maybe either there might be some flaw in the rollback logic under strange
circumstances or that re-tried splits
don't check for left-over reference files and such. Though it is strange, that one of the
successfully copied hfiles still has it's ref-file.

I'm sorry, that I can not provide full logs for the period, as this has been silently lurking
for quite a while.
Any ideas as to why the daughter region was able to stay offline for so long without any problems?
I think this might almost
be a separate issue too.

On the Mailinglist it seemed like [~kevin.odell] has encountered this issue before, maybe
he can help us overcome the incomplete information from logs.

How can we go about this with what we have? Any ideas how to reproduce and verify the behaviour?

{quote}
hadoop-cmf-hdfs1-NAMENODE-mia-node08.miacluster.priv.log.out.2:2013-03-19 13:39:52,937 mia-node08.miacluster.priv
WARN org.apache.hadoop.hdfs.StateChange: DIR* FSDirectory.unprotectedRenameTo: failed to rename
/hbase/documents/e6227aaa6f6e03188372ec534bf7e150/d/0707b1ec4c6b41cf9174e0d2a1785fe9.5b9c16898a371de58f31f0bdf86b1f8b
to /hbase/documents/79c619508659018ff3ef0887611eb8f7/d/0707b1ec4c6b41cf9174e0d2a1785fe9.5b9c16898a371de58f31f0bdf86b1f8b
because destination exists
hadoop-cmf-hdfs1-NAMENODE-mia-node08.miacluster.priv.log.out.2:2013-03-19 13:39:52,938 mia-node08.miacluster.priv
WARN org.apache.hadoop.hdfs.StateChange: DIR* FSDirectory.unprotectedRenameTo: failed to rename
/hbase/documents/e6227aaa6f6e03188372ec534bf7e150/d/47511faae81b4452afd3ca206e28346f.5b9c16898a371de58f31f0bdf86b1f8b
to /hbase/documents/79c619508659018ff3ef0887611eb8f7/d/47511faae81b4452afd3ca206e28346f.5b9c16898a371de58f31f0bdf86b1f8b
because destination exists
hadoop-cmf-hdfs1-NAMENODE-mia-node08.miacluster.priv.log.out.2:2013-03-19 13:39:52,938 mia-node08.miacluster.priv
WARN org.apache.hadoop.hdfs.StateChange: DIR* FSDirectory.unprotectedRenameTo: failed to rename
/hbase/documents/e6227aaa6f6e03188372ec534bf7e150/d/4f01ecd052ce464d81e79a62ea227d6b.5b9c16898a371de58f31f0bdf86b1f8b
to /hbase/documents/79c619508659018ff3ef0887611eb8f7/d/4f01ecd052ce464d81e79a62ea227d6b.5b9c16898a371de58f31f0bdf86b1f8b
because destination exists
{quote}


                
> Eternally stuck Region after split
> ----------------------------------
>
>                 Key: HBASE-8502
>                 URL: https://issues.apache.org/jira/browse/HBASE-8502
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.92.1
>            Reporter: Dimitri Goldin
>            Priority: Critical
>         Attachments: hbase_lost_parent.txt, stuck_region_exception.txt
>
>
> Exact HBase version: 0.92.1-cdh4.1.2
> A couple of days ago I encountered a RIT problem with a single region.
> After an hbck run it started trying to assign a region which has been 
> bouncing between OFFLINE/PENDING_OPEN/OPENING for two days afterwards.
> This was due to a split gone wrong in some way, which led to several 
> reference files being left in the region-directory despite the two relevant HFiles being
copies successfully to the daughter.
> I will try to give as many details as possible, but unfortunately I was
> unable to find any information about the split itself.
> Short thread about this issue on the users-ML: http://mail-archives.apache.org/mod_mbox/hbase-user/201305.mbox/%3C5182758B.1060306@neofonie.de%3E
> ===
> Parent region: 5b9c16898a371de58f31f0bdf86b1f8b
> Daughter region in question: 79c619508659018ff3ef0887611eb8f7
> Rough sequence from the logs seems to be the following:
> ===
> * Received request to open region:
> documents,7128586022887322720,1363696791400.79c619508659018ff3ef0887611eb8f7.
> * Setting up tabledescriptor config now ...
> * Opening of region {NAME =>
> 'documents,7128586022887322720,1363696791400.79c619508659018ff3ef0887611eb8f7.',
>      STARTKEY => '7128586022887322720',
>      ENDKEY => '7130716361635801616',
>      ENCODED => 79c619508659018ff3ef0887611eb8f7,} failed, marking as 
> FAILED_OPEN in ZK
> * File does not exist: 
> /hbase/documents/5b9c16898a371de58f31f0bdf86b1f8b/d/0707b1ec4c6b41cf9174e0d2a1785fe9

> [...]
> ===
> What happened, was that somehow (and that's the question here) the daughters
> region folder contained some left-over reference files were causing the 
> RegionServer to look-up the parent region, which already was deleted.
> original contents of /hbase/documents/79c619508659018ff3ef0887611eb8f7/d:
> ==
> 0707b1ec4c6b41cf9174e0d2a1785fe9.5b9c16898a371de58f31f0bdf86b1f8b
> 47511faae81b4452afd3ca206e28346f.5b9c16898a371de58f31f0bdf86b1f8b
> 4f01ecd052ce464d81e79a62ea227d6b
> 4f01ecd052ce464d81e79a62ea227d6b.5b9c16898a371de58f31f0bdf86b1f8b
> eb7dbb09701d4353be24ca82481c4a7e
> == 
> I attached the full FileNotFound Exception.
> Please let me know if I can provide more information or help otherwise.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message