hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ryan rawson (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HBASE-2856) TestAcidGuarantee broken on trunk
Date Mon, 16 Aug 2010 22:02:16 GMT

    [ https://issues.apache.org/jira/browse/HBASE-2856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899119#action_12899119

ryan rawson commented on HBASE-2856:

This issue is pernicious and difficult.  I have attempted many times to nail this but there
continue to be issues.

Right now there are at least 2 issues:
- When a memstore is flushed to HFile the 'memstoreTS' generation stamp is dropped.  With
the way the scanners work right now, this causes some minor issue.  A row is really composed
on a sequence of KeyValues, with different memstoreTS values, so you can imagine right before
the flush the scanner looks sort of like so:

                           scanner points here
(new gen here) |
K K K K K K K K K K K K K K K K K K K K K  (keys for 1 row)
                                    (new and old gen)

The new gen have newer memstoreTS value and were ignored during the next().  So the scanner
is pointing to a Key part of the way through a row.   When the memstore is pushed to a HFile,
we mistakenly ignore all the skipped new gen keys, and we end up with the FIRST KeyValue from
the older gen, and the subsequent columns from the newer gen, ending up with a mixed generation
result, thus looking like non-atomic put.

One solution to this problem is to take the top Key during a scanner reset (done after a switch
between memstore -> hfile) and transform it into KeyValue.createFirstOnRow(). There has
to be exception code for partial row fetches where the scanner gets partial row results each
time (Because the row is too big).

- The second issue is that of scanner updates and multi-column families.  Right now we do
code sort of like this:
for each store (aka: family) in region:
   switch memstore -> hfile

For each Store/family the switch between memstore->hfile is done atomically with respect
to scanners, that is for a scanner they either access the old snapshot and old files or the
new null snapshot and the new files.  But this atomic snapshot is done at a Store at-a-time
level, meaning that concurrent scanners might see some data in one store from memstore (complete
with generation timestamps) and some data in another store all from hfile (with no timestamps),
thus again ending up with mixed generation row data that violates row atomicity.

One possible solution to this problem is to move scanner updates to the HRegion/RegionScanner
level.  This is not as easy as it sounds from a code structure point of view.  Furthermore
by using larger update blocks we may be introducing performance issues surrounding scanner


The ultimate problem here is these two problems are just the most recently identified bugs
in a long list of bugs.  We can fix them in the ways described above, but what about the next
series of bugs?  We are plugging leaks in a dam.  A better approach would be to build a new,
better, dam, only with the knowledge we have about what previously went wrong.


The proposal is thusly:
- Add a new generation timestamp (or perhaps we could call it transaction id) to every KeyValue.
 This generation timestamp is also serialized and stored inside HFile.  
- The generation timestamp starts at 1 for brand new tables.  Every write operation increments
this by 1, much like ReadWriteConsistencyControl
- KeyValues with ts=0 belong to no generation and are always included in every read operation
- KeyValues with ts>0 belong to that specific generation and the reader should include
them only if their own read point is larger. The reader obtains the read point at scanner
start time, and may possibly update it between rows if the implementation can do so. (As per
the current RWCC algorithms)
- Bulk imported hfiles have all their TS=0 (the default value)
- Modifications set their ts=X as per the current RWCC algorithms.
- During hfile flush the largest TS is written to the HFile metadata.
- During HRegion/Store init, the largest TS is obtained from HFile metadata, and the RWCC
is set to this value.

Some predicted questions:
- What about the extra space? 8 bytes is a lot.  
-- With vint encoding and that the values start at 1, we should be able to shave a lot of
space off.  We will also have to store a length field with 1 extra byte. 2 bytes minimum for
- What about vint decode speed?
-- It may not be an issue, we need to profile.  If it is, we can trade off RAM space in KeyValue
to cache this.
- What about file migration? I dont want to take a long time to upgrade my cluster!
-- By creating a new metadata key, call it KEYVALUE_VERSION, the Store layer on top of HFile
can parse the older KeyValues from HFile (which wont have this metadata), substituting '0'
for the missing timestamp. 
-- Newer HFiles will use the newer KeyValue parse code to obtain the generation timestamp.
- This is a lot of cost just to fix something that seems we should know in memory
-- Yes it isn't cheap, but this problem has proven to be difficult to solve. By addressing
the issue directly we can have confidence in our design and the existing test frameworks which
are failing should validate it for us.
- What about splits?
-- Each split region will 'take off' from the maximal generation timestamp and evolve independently.
 If they were to be merged together that'd be ok, each region would have to close, halting
all writes, then take the largest generation timestamp and continue evolving from there. 
The timestamp exists to differentiate committed and non-committed data, and during a close
by definition all values in the HFile would be committed.
- Ok so I'm paying at least 2 bytes per KeyValue, can we get better return on investment?
-- Yes!  By using the generation timestamp we can detect the situation where a previously
inserted Delete is masking new Puts.  We can switch between the existing behaviour and newer
behaviour at runtime and ignore the rogue Delete marker.
-- There may be other benefits as well. Providing a partial total order to all the KeyValue
inserts and grouped by transaction would provide useful debugging info, for both HBase and
client bugs.
- What about read isolation?
-- With this we can definitely implement at-scanner-open read isolation.  Due to the nature
of the scanner design advancing the generation timestamp read point is more difficult, but
it may be possible.

> TestAcidGuarantee broken on trunk 
> ----------------------------------
>                 Key: HBASE-2856
>                 URL: https://issues.apache.org/jira/browse/HBASE-2856
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.89.20100621
>            Reporter: ryan rawson
>            Assignee: ryan rawson
>            Priority: Blocker
>             Fix For: 0.90.0
> TestAcidGuarantee has a test whereby it attempts to read a number of columns from a row,
and every so often the first column of N is different, when it should be the same.  This is
a bug deep inside the scanner whereby the first peek() of a row is done at time T then the
rest of the read is done at T+1 after a flush, thus the memstoreTS data is lost, and previously
'uncommitted' data becomes committed and flushed to disk.
> One possible solution is to introduce the memstoreTS (or similarly equivalent value)
to the HFile thus allowing us to preserve read consistency past flushes.  Another solution
involves fixing the scanners so that peek() is not destructive (and thus might return different
things at different times alas).

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message