hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Hsieh <...@cloudera.com>
Subject hbase type encodings discussion part 2.
Date Fri, 16 May 2014 16:41:10 GMT
Below is a summary from a follow up conversation to the previous pow-wow
[1] at the post hbasecon 2014 hackthon about an interoperable proposed
encoding scheme for storing typed data in hbase. [2]  Raw notes available
here [3].



Attendees via phone: Nick Dimiduk, Ryan Blue, Jon Hsieh, Michael Stack,
Enis Soztutar, and James Taylor.

The group decided to first define requirements for the encoding. The group
recommends these as requirements for the chosen value encoding.

1) must have a memcomparable rowkey.
2) must have null value distinct from empty sting in row key
3) must be able to add nullable fields to end of primary key
4) must either have
  a) indexable fields must be nullable, or
  b) any type that doesn't support nullability must be translatable to a
type that does without data loss. (e.g. fixed width int translate to a
nullable numeric)
5) all char types will be stored by a varlength binary (support chars that
are >1 byte.)
6) fixed length binary values (e.g. md5's) should be a special case but
supported.  caveat emptor -- if you lose your schema, its your fault.
 (won't be able to decipher without schema).

Discussion: varbinary in-key encoding options:
1) single \0 byte terminator with no \0 allowed (phoenix style)
2) run length encoded \0's with two byte terminator (proposed)
3) 8 bytes for every 7 bytes "varblob" encoding (ordered bytes style)
Recommendation: run length encoded \0 with two byte terminator. (handles
nulls, easily human readable, likely low overhead common)

Discussion: mulitpart key encodings
1) tagged bytes - includes field position + type tag
2) type ordinals - ordered bytes encodings.
Recommendation: use position+ type tag approach.

Discussion: data type api:
- we like the goal of the data type api (HBASE-8693), will try to use api
for proposed key encoding api.
- the arbitrary precision numeric type provides advantages.
- will try to implement encodings by modding or patterning off of the
OrderedBytes implementation.
- jon to try plumbing a type through the data types api and into phoenix to
enable existing phoenix queries to see end-to-end perf impact.

Remaining topics and followup:
- how to handle "complex primitives" such as datetime, decimal and bigint.
- more discussion on list.
- plan to present updates and continue discussion at hadoop summit hbase
bof session thursday 6/5 in san jose [4]


// Jonathan Hsieh (shay)
// HBase Tech Lead, Software Engineer, Cloudera
// jon@cloudera.com // @jmhsieh

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message