hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Dimiduk (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-8693) Implement extensible type API based on serialization primitives
Date Wed, 17 Jul 2013 23:06:49 GMT

    [ https://issues.apache.org/jira/browse/HBASE-8693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13711775#comment-13711775
] 

Nick Dimiduk commented on HBASE-8693:
-------------------------------------

bq. Ok, make sense with this limited scope (no schema) have a fixed list of fields.

Right. In this implementation Struct is a simple concatenation of fields. No schema information
is written into that concatenation because to do so will mess with sort order. Struct is merely
API convenience. Now, the field encodings implemented in OrderedBytes include a header byte
which is currently used to identify the type of encoded field that follows. The full space
of 256 available bit patterns in that header bit is not consumed by the current implementation.
I've been thinking about extending that header byte to include some version bits at the very
beginning. That would enable evolution of the individual field encodings (say, if you later
want to re-implement blob-mid, for example). This doesn't address the user-level logical structure
of a Struct data type, only evolution of the OrderedBytes codec.

bq. My main concern is: I start use 96 with this struct encoding... is fixed so I can't add
fields.. so I work around it adding a version number in front of the struct and then I do
the switch for v1, v2, v3 with all the fixed struct that I know...

Prepending a version number to the Struct's members will impact sort order. Struct definition
is fixed in that you can't prepend or interpose a new field in the middle of an existing encoded
value. You're free to append fields. Appending a field would look like the following:

 # application defines Struct v0 with members [A,B,C]
 # application writes lots of data
 # application changes, Struct v1 becomes [A,B,C,D,E]
 # application writes lots more data

At step 3, the application now needs to become version aware. Because the fields of v0 are
a subset of v1, the application can use the definition of struct v1 with the following safe-guards.
(1) Any place where v0 was used, it now needs to be sure to check for end-of-buffer and skip
over the two new elements. (2) Anywhere v1 is used, mindful of truncated records and be prepared
to only receive the v0 fields. Maybe the API defined around Struct can be improved to support
these needs?

Records of v0 and v1 can be intermixed, ie, as rowkeys in the same table. According to the
documented sort semantics, they'll sort "left-to-right and depth-first". Meaning, they'll
sort first according to v0 values and then within that group, by v1 values.

We leave all of this up to user applications today, so this change management isn't mitigated.
Changing a compound rowkey today requires rewriting data (or duplication into a new table).
A smarter struct encoding, one that's able to preserve the sorted semantics I've described
but that can also track more sophisticated schama change would be very useful indeed -- I
don't think it exists.

Prepending a version field to a Struct will change the sorting behavior; v0 will sort before
v1, &c. IMHO, this is a less flexible migration strategy than the append behavior described
above. It's also perfectly valid, and the user of the Struct API is free to do so in their
own application. In that case, the application is still version-aware. Instead of being cautious
about consuming the potentially truncated records, instead it's executing a scan for each
version.

bq. as you said, data evolution is out of the scope. so if you consider this patch just as
a "smarter" alternative to the Bytes encoding.

HBASE-8201 is a smarter alternative to Bytes and this ticket adds some higher-level APIs for
manipulating them. In short, yes, schema definition and evolution is out of scope.
                
> Implement extensible type API based on serialization primitives
> ---------------------------------------------------------------
>
>                 Key: HBASE-8693
>                 URL: https://issues.apache.org/jira/browse/HBASE-8693
>             Project: HBase
>          Issue Type: Sub-task
>          Components: Client
>            Reporter: Nick Dimiduk
>            Assignee: Nick Dimiduk
>             Fix For: 0.95.2
>
>         Attachments: 0001-HBASE-8693-Extensible-data-types-API.patch, 0001-HBASE-8693-Extensible-data-types-API.patch,
0001-HBASE-8693-Extensible-data-types-API.patch, 0002-HBASE-8693-example-Use-DataType-API-to-build-regionN.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message