hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Himanshu Vashishtha (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-6774) Immediate assignment of regions that don't have entries in HLog
Date Tue, 30 Apr 2013 14:10:16 GMT

    [ https://issues.apache.org/jira/browse/HBASE-6774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13645599#comment-13645599

Himanshu Vashishtha commented on HBASE-6774:

Hey Enis,

Thanks for asking these questions.

There is a *max_completeSequenceId* per regionserver field in the attached doc, which is updated
after receiving the heartbeat from a regionserver. When master processes the server shutdown
event, it will use the max_completeSequenceId for the regionserver in order to determine how
much WAL is relevant (it has missed) and need to read before finalizing allWALEntriesFlushed.
The goal is to process all WALEdits which have walEdit#key#logSequenceId > max_completeSequenceId.
If that means reading second last WAL also, it will process that too. The invariant is to
read latest WAL files first, until we reach the point where some waledits in the wal are s.t.
WALedit#key#logSequenceId < max_completeSequenceId. We no longer need to read older WALs

bq. If a region has not got any update for some time, its latestCompleteFlushSeqId wont be
updated at all, since there will be no flushes. To reassign this region, we have to ensure
that all wals are read. 

It uses max_completeSequenceId to read the remaining WAL. Once it has read all the WALEdits
after max_completeSequenceId, allWALEntriesFlushed will have the correct information, and
it can be used to assign a region or not. 

bq. The only reliable way is to read up the wal backwards, 
I am not sure whether a sequenceFile can be read backwards, or how efficient it would be.
That's why I propose to read a WAL file from its head and re-use the existing WALReader code.

As soon as any region is flushed, master will have the most updated information for all regions
for that regionserver once it receives the next heartbeat.

Consider a rogue scenario: A regionserver sends a report and the max_completeSequenceId =
100. There is a write heavy workload and WAL is rolled and then server abort. And master missed
all its heartbeats before the rs aborted. Based on max_completeSequenceId, we need to read
last 2 WAL files (1 + 1): 1 new one, and 1 at which master got the last heartbeat (it has
some entries > 100). Since we are reading most current ones first, it is easy to determine
whether we need to older WALs or not. Let's call those files f1 and f2 where f1 is the latest.

It reads f1 first and see that the first waledit#key#logSequenceId > 100, so it en-queues
f2 also as there might be some entries at f2's tail which are missed.
Once it has read f1 and f2, and updated the allWALEntriesFlushed for the regions, master can
decide which regions can be assigned right away.

Hope this helps.
> Immediate assignment of regions that don't have entries in HLog
> ---------------------------------------------------------------
>                 Key: HBASE-6774
>                 URL: https://issues.apache.org/jira/browse/HBASE-6774
>             Project: HBase
>          Issue Type: Improvement
>          Components: master, regionserver
>    Affects Versions: 0.95.2
>            Reporter: Nicolas Liochon
>            Assignee: Himanshu Vashishtha
>         Attachments: HBase-6774-approach.pdf
> The algo is today, after a failure detection:
> - split the logs
> - when all the logs are split, assign the regions
> But some regions can have no entries at all in the HLog. There are many reasons for this:
> - kind of reference or historical tables. Bulk written sometimes then read only.
> - sequential rowkeys. In this case, most of the regions will be read only. But they can
be in a regionserver with a lot of writes.
> - tables flushed often for safety reasons. I'm thinking about meta here.
> For meta; we can imagine flushing very often. Hence, the recovery for meta, in many cases,
will be the failure detection time.
> There are different possible algos:
> Option 1)
>  A new task is added, in parallel of the split. This task reads all the HLog. If there
is no entry for a region, this region is assigned.
>  Pro: simple
>  Cons: We will need to read all the files. Add a read.
> Option 2)
>  The master writes in ZK the number of log files, per region.
>  When the regionserver starts the split, it reads the full block (64M) and decrease the
log file counter of the region. If it reaches 0, the assign start. At the end of its split,
the region server decreases the counter as well. This allow to start the assign even if not
all the HLog are finished. It would allow to make some regions available even if we have an
issue in one of the log file.
>  Pro: parallel
>  Cons: add something to do for the region server. Requites to read the whole file before
starting to write. 
> Option 3)
>  Add some metadata at the end of the log file. The last log file won't have meta data,
as if we are recovering, it's because the server crashed. But the others will. And last log
file should be smaller (half a block on average).  
> Option 4) Still some metadata, but in a different file. Cons: write are increased (but
not that much, we just need to write the region once). Pros: if we lose the HLog files (major
failure, no replica available) we can still continue with the regions that were not written
at this stage.
> I think it should be done, even if none of the algorithm above is totally convincing
yet. It's linked as well to locality and short circuit reads: with these two points reading
the file twice become much less of an issue for example. My current preference would be to
open the file twice in the region server, once for splitting as of today, once for a quick
read looking for unused regions. Who knows, may be it would even be faster this way, the quick
read thread would warm-up the different caches for the splitting thread.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message