hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jesse Yates (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-6180) [brainstorm] Timestamp based snapshots in HBase 0.96
Date Wed, 06 Jun 2012 23:01:23 GMT

    [ https://issues.apache.org/jira/browse/HBASE-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13290554#comment-13290554

Jesse Yates commented on HBASE-6180:

Here is what I've been thinking about for doing timestamp based snapshotting, as an extension
to the work I've been doing for HBASE-6055.

Timestamp based snapshots are a zero-downtime/non-blocking versions of taking a snapshot across
a table in HBase. They should be considered 'fuzzy' because you don't get a global view, but
only as close to globally consistent as we can get with timestamps on the region servers (fuzziness
is in the NTP different between RS, which defaults to max skew of 60 sec). I'm going to mingle
a bit of theory and implementation here, but feel free to ask questions for things that don't
make sense.

All the infrastructure from point-in-time snapshots (HBASE-50) is still going to be used here:
SnapshotManager on the Master, the RegionSnapshotHandler on the RS, etc. The only change is
what actually happens on each of the regions when taking the snapshot and how the snapshot
is managed on the Master. Also, on a lower level, the time constraints are much looser on
taking the snapshot.

Lets walk throughout some of the changes to the actual implementation. 

>From a high-level, we still tell all the RS to start the snapshot. They will then dump
a meta edit into the WAL with the memstore timestamp (not clear if this is necessary, but
could be useful for completing snapshots on failed RS). They will then post back to ZK that
they are starting the snapshot. Each RS can then go about their business adding references
to all the files on the FS for the Regions involved in the snapshot. It gets a little tricky
when we try to capture the in-memory state of each RS.

The key here though is that we can use the Memstore's built-in snapshot functionality to avoid
doing any work with the WALs and just keep track of HFiles. When flushing the Memstore takes
a "snapshot" by just blocking for a moment to switch two pointers between the current and
the new memstore. All writes before the switch go into the old memstore. All new writes go
into the new memstore. The old memstore can then asynchronously be flushed to disk and on
scan we just merge in the results from the old version as well as the on disk files. The benefit
of this is that we basically take no down time to flush the memstore (except for corner cases
where there are too many HFiles on disk already, but we can ignore that as it part of the
overall HBase design). 

We can leverage the same mechanism but instead just make the swap time-based.

When the RS gets the update to take the snapshot, it also has a timespan through which it
should split writes between the memstores. For example, say we get a snapshot start notice
at 10:15:00 and a prepare phase length of 80 seconds (the max skew in the cluster +20 seconds
for safety - just an example). 

For those 80 seconds, each HRegion will then time-flush the memstore. We take a regular memstore
snapshot. Just like a regular flush, this ensures that all the outstanding writes to the memstore
get committed (waiting on the read point to roll forward). However, instead of immediately
writing to the new store, we split writes based on timestamp between the old and the new memstore.
This management is handled by the Store, which just does some simple checking on the edits
coming through to see which memstore it should direct the writes (admittedly, hand-waving
away some of the complexity here).

Conceptually, this is like taking  snapshot, but instead of just having the snapshot be the
immutable state (less the rollbacks made), we can just pass that KV set into a new MemStore
that acts just like a regular memstore. Since all the high-level edits still go through the
mvcc, we can keep track of the ordering in writes and the rollback mechanism on the original
MemStore actually keeps its own state and the state of the snapshot-based MemStore in the
correct state.

At this point, we can update the master (via ZK) that we have joined the snapshot. This is
not strictly necessary, but is nice since we can then track progress of a snapshot. For instance,
if a RS hasn't responded in within a certain window, we can immediately fail the snapshot
and assume the RS has become inoperable. Since we are using the internal flushing mechanisms
to remain mostly non-blocking, we can actually skip doing this update and just notify the
master when we have done the write.

An alternative implementation here is to do what Jon has suggested and do a set a meta writes
for the beginning and end of a snapshot. Then all you have to do is keep track of the WALs
for the snapshot and replay those at the right time. However, that adds some complexity into
how to restore a snapshot and may require rolling the WAL after the snapshot has been taken
- a worrisome amount of complexity for something that should be entirely immutable. Since
the flush can be done async and we don't block writes while waiting, it doesn't seem like
a major issue to wait a little longer to complete a snapshot.

Back to the dual-pointer memstore snapshot implementation, once we pass the 80 seconds, we
then flush the old store to disk, add a reference to the new HFile, and then just direct all
writes to the new store. Conceptually, this all seems to hang together, but the implementation
is probably going to take a little more work.

There is a slight overhead to writes during the snapshot window. We will need to check the
timestamp of every write going into the memstore, to figure out the store it needs to be written
into. However, that is just a simple timestamp comparison and shouldn't be overly burdensome
to the write throughput (especially if you can take a snapshot during a low-write period).

After this snapshot window, the state of the memstore will have been snapshotted and a flush
will have been started. Now we can just flush this old memstore to disk as another HFile and
add a reference to it for the snapshot. Its completely fine if this process takes a while
because the server precedes happily, taking reads and writes like nothing is amiss because
the semantics are the same as a regular flush. Once the file hits disk (and we have added
references for each of the other files) we can consider the snapshot completed on that HRegion.
Once that process completes for all involved HRegions on the HRegionServer we can consider
the snapshot having completed the snapshot. 

Note that since the in memory state is all written to disk, we don't actually need to keep
track of any of the HLogs. There is probably some re-jiggering here around failed Puts and
the optimized write path there, but that is an implementation detail.

Once all the HRegionServers have taken the snapshot (passing up the notification by joining
the barrier), the Master considers the snapshot completed and can move the snapshot from the
.tmp to the .snapshot directory. The complete barrier is then just a barrier for the master,
rather than for the region servers since there is not coordination necessary except to determine
if a snapshot failed because a RS couldn't complete (which only the master needs to keep track
of, to determine if a snapshot is valid or not).

There are some gotcha's with snapshotting with timestamps.

Suppose you are putting writes into the future. On a regular table doing a timestamp based
Scan will still not find those futures writes; the same will be true of the snapshotted table
- those writes will be directed to the new store and not found in the snapshot. 

The only weirdness that occurs with this form of snapshots is with future/past writes - essentially
any time you start messing with the timestamps. Let's look an example. At 10:15:00 you take
a snapshot of a table. However, on the same table, you make a Put  - 'row', 'cf', 10:20:00,
'value' - at 10:10:00, a put in the future but made _before_ you take a snapshot. The snapshot
then precedes as expected. At some point later, you revive the snapshot and do a scan of the
table with a timestamp of 10:15:00.; you won't find that earlier put ('row', 'cf', 10:20:00,
'value'). However, if you just do a scan for the latest version, you *will* find that put!

It gets even odder if instead of making that future put before the snapshot was taken, but
instead made it _while_ the snapshot was being taken. In this case, the revived snapshot will
give you different semantics. The scan of the snapshot at 10:15:00 will still give you the
same answer as before, but the latest version scan _will not find_ the future Put ('row',
'cf', 10:20:00, 'value'). 

Unfortunately, these are the semantics of using timestamps over global consistency. I (and
many others) feel that if you are messing with timestamps then its buyer beware. 

That said, there is way to get global consistency if you do mess with timestamps. If you have
some centralized timestamp oracle, then this can give out strictly increasing timestamps with
a lease for the timestamps.  (I've got a long flight next week where I'm hoping to pump out
a basic implementation of this for hbase - no ticket, but just a little something on github).
Since you know that the timestamps will expire after a given period, you just set the expiration
time + fudge  as the timespan to split the memstore writes. After the expiration period you
know that a timestamp is the oldest timestamp, so you can then comfortably flush the old memstore
to disk, knowing that you have all the edits from that timestamp back in time. Note that you
don't have the same problem as above since you only do scans in terms of the timestamps from
the oracle, so future and past are really globally relative - there is no real puts too far
into the future or past that are visible because all scans need to be as of a timestamp.
> [brainstorm] Timestamp based snapshots in HBase 0.96
> ----------------------------------------------------
>                 Key: HBASE-6180
>                 URL: https://issues.apache.org/jira/browse/HBASE-6180
>             Project: HBase
>          Issue Type: Brainstorming
>            Reporter: Jesse Yates
>             Fix For: 0.96.0
> Discussion ticket around doing timestamp based snapshots in HBase as an extension/follow-on
work for HBASE-6055. The implementation in HBASE-6055 (as originally defined) is not sufficient
for real-time clusters because it requires downtime to take the snapshot. 
> Time-stamp based snapshots should not require downtime at the cost of achieving global

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message