hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aaron Kimball (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-3872) Hole in split transaction rollback; edits to .META. need to be rolled back even if it seems like they didn't make it
Date Thu, 23 Jun 2011 05:17:47 GMT

    [ https://issues.apache.org/jira/browse/HBASE-3872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053659#comment-13053659
] 

Aaron Kimball commented on HBASE-3872:
--------------------------------------

A further observation: this seems to have occurred when splitting multiple regions within
the same table (during a day of large bulk loads). The logs show the parent region being offlined,
then both daughter regions being instantiated. The following sequence of log messages appeared
both times:

{code}
2011-06-21 21:51:17,594 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Instantiated (redacted-a-daughter).
2011-06-21 21:51:17,630 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Instantiated (redacted-b-daughter).
2011-06-21 21:52:05,412 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period
3600000ms elapsed
2011-06-21 21:52:17,666 INFO org.apache.hadoop.hbase.regionserver.CompactSplitThread: Running
rollback of failed split of (redacted-parent-region); Call to (redacted-server-address):60020
failed on socket timeout exception: java.net.SocketTimeoutException: 60000 millis timeout
while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
local=(redacted-server-ip):54054 remote=(redacted-server-address):60020]
{code}

I find it noteworthy that the "Hlog roll period elapsed" message occurred between the "B"
daughter instantiation and the socket timeout in both cases of missing regions I am aware
of in my table.


> Hole in split transaction rollback; edits to .META. need to be rolled back even if it
seems like they didn't make it
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3872
>                 URL: https://issues.apache.org/jira/browse/HBASE-3872
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 0.90.3
>            Reporter: stack
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.90.4
>
>         Attachments: 3872.txt
>
>
> Saw this interesting one on a cluster of ours.  The cluster was configured with too few
handlers so lots of the phenomeneon where actions were queued but then by the time they got
into the server and tried respond to the client, the client had disconnected because of the
timeout of 60 seconds.  Well, the meta edits for a split were queued at the regionserver carrying
.META. and by the time it went to write back, the client had gone (the first insert of parent
offline with daughter regions added as info:splitA and info:splitB).  The client presumed
the edits failed and 'successfully' rolled back the transaction (failing to undo .META. edits
thinking they didn't go through).
> A few minutes later the .META. scanner on master runs.  It sees 'no references' in daughters
-- the daughters had been cleaned up as part of the split transaction rollback -- so it thinks
its safe to delete the parent.
> Two things:
> + Tighten up check in master... need to check daughter region at least exists and possibly
the daughter region has an entry in .META.
> + Dependent on the edit that fails, schedule rollback edits though it will seem like
they didn't go through.
> This is pretty critical one.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message