hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "nkeywal (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-6752) On region server failure, serve writes and timeranged reads during the log split
Date Fri, 21 Sep 2012 19:25:08 GMT

    [ https://issues.apache.org/jira/browse/HBASE-6752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13460742#comment-13460742

nkeywal commented on HBASE-6752:

Seems reasonable, there are still some dark areas around timerange. Let's do thing smoothly
:-). But I think your comment is right.

Some various points I had in mind:
There is another use case mentionned in HBASE-3745: "In some applications, a common access
pattern is to frequently scan tables with a time range predicate restricted to a fairly recent
time window. For example, you may want to do an incremental aggregation or indexing step only
on rows that have changed in the last hour. We do this efficiently by tracking min and max
timestamp on an HFile level, so that old HFiles don't have to be read."

bq. We do want the old edits to come in the correct order of sequence ids 
Imho yes, we should not relax any point of the HBase consistency.

bq. So, we somehow need to cheaply find the correct sequence id to use for the new puts. It
needs to be bigger than sequence ids for all the edits for that region in the log files. So
maybe all that's needed here is to open recover the latest log file, and scan it to find the
last sequence id?
I would like HBase to be resilient to log files issues (no replica, corrupted files, overloaded
datanodes, bad luck when choosing the datanode to read from...) by not opening them at all
during this process. Would a guess estimate be ok? counting the number of files/blocks to
calculate the maximum number of id?

bq. Picking a winner among duplicates in two files relies on using sequence id of the HFile
as a tie-break. And therefore, today, compactions always pick a dense subrange of files order
by sequence ids. 

I wonder if we need major compactions? I was thinking that they could be skipped. But we need
to be able to manage small compactions for sure. I imagine that we can have some critical
cases where we can be in the intermediate state a few days: (week end + trying to fix the
broken hlog on a test cluster + waiting for a non critical moment for fixing the production

> On region server failure, serve writes and timeranged reads during the log split
> --------------------------------------------------------------------------------
>                 Key: HBASE-6752
>                 URL: https://issues.apache.org/jira/browse/HBASE-6752
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: Gregory Chanan
>            Priority: Minor
> Opening for write on failure would mean:
> - Assign the region to a new regionserver. It marks the region as recovering
>   -- specific exception returned to the client when we cannot server.
>   -- allow them to know where they stand. The exception can include some time information
(failure stated on: ...)
>   -- allow them to go immediately on the right regionserver, instead of retrying or calling
the region holding meta to get the new address
>      => save network calls, lower the load on meta.
> - Do the split as today. Priority is given to region server holding the new regions
>   -- help to share the load balancing code: the split is done by region server considered
as available for new regions
>   -- help locality (the recovered edits are available on the region server) => lower
the network usage
> - When the split is finished, we're done as of today
> - while the split is progressing, the region server can
>  -- serve writes
>    --- that's useful for all application that need to write but not read immediately:
>    --- whatever logs events to analyze them later
>    --- opentsdb is a perfect example.   
>  -- serve reads if they have a compatible time range. For heavily used tables, it could
be an help, because:
>    --- we can expect to have a few minutes of data only (as it's loaded)
>    --- the heaviest queries, often accepts a few -or more- minutes delay. 
> Some "What if":
> 1) the split fails
> => Retry until it works. As today. Just that we serves writes. We need to know (as
today) that the region has not recovered if we fail again.
> 2) the regionserver fails during the split
> => As 1 and as of today/
> 3) the regionserver fails after the split but before the state change to fully available.
> => New assign. More logs to split (the ones already dones and the new ones).
> 4) the assignment fails
> => Retry until it works. As today.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message