hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jim Kellerman (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-505) Region assignments should never time out so long as the region server reports that it is processing the open request
Date Sun, 30 Mar 2008 23:16:24 GMT

    [ https://issues.apache.org/jira/browse/HBASE-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12583503#action_12583503
] 

Jim Kellerman commented on HBASE-505:
-------------------------------------

Reviewed patch +1. I especially like the idea of using a progressable to indicate that something
is really happening.

> Region assignments should never time out so long as the region server reports that it
is processing the open request
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-505
>                 URL: https://issues.apache.org/jira/browse/HBASE-505
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.2.0, 0.1.0
>            Reporter: Jim Kellerman
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.2.0, 0.1.1
>
>         Attachments: 505.patch
>
>
> Currently, when the master assigns a region to a region server, it extends the reassignment
timeout when the region server reports that it is processing the open. This only happens once,
and so if the region takes a long time to come on line due to a large set of transactions
in the redo log or because the initial compaction takes a long time, the master will assign
the region to another server when the reassignment timeout occurs.
> Assigning a region to multiple region servers can easily corrupt the region. For example:
> region server 1 is processing the redo log creating a new mapfile. It takes more than
one interval to do so so the master assigns the region to region server 2. region server 2
starts processing the redo log creating essentially the same mapFile as region server 1, but
with a different name. 
> region server 2 can fail to open the region if region server 1 deletes the old log file
or if it tries to open the new mapFile that region server 1 is creating.
> region server 1 can fail to open the region if it tries to open the mapFile that region
server 2 is creating.
> Often region server 1 eventually succeeds and reports to the master that it has finished
opening the region, but the master tells it to close that region because it has assigned it
to another server. Region server 2 often fails to open the region, because the old log file
has been deleted, or it fails to process the new map file created by region server 1.
> Proposed solution:
> During the open process the region server should send a MSG_PROCESS_OPEN with each heartbeat
until the region is opened (when it sends MSG_REGION_OPEN). The master will extend the reassignment
timeout with each MSG_PROCESS_OPEN it receives and will not assign the region to another server
so long as it continues to receive heart beat messages from the region server processing the
open.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message