hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ryan Blue <rb...@cloudera.com>
Subject Re: [common type encoding breakout] Re: HBase Hackathon @ Salesforce 05/06/2014 notes
Date Tue, 13 May 2014 22:58:44 GMT
Here are a few more specific responses.

Hopefully this clears up some remaining points in the context of my last 
post.

> Why not use protobuf directly instead of reimplementing a slight
> variation of their format?

I intend to use protobuf directly for compound values. It isn't 
practical right now for keys because protobuf doesn't have value 
encodings that are memcmp, nor are its tags memcmp for fields > 16.

>     * memcmp encodings for primitives in cells desired for phoenix (2ndary
>     indices?)
>
> This sounds like a Phoenix-specific decision.

I think it's okay for the spec to optimize for certain patterns. Using 
the memcmp encodings in primitive cells allows us to do value comparison 
on encoded bytes and speed up scans. I was under the impression that 
this is something Phoenix does to speed up results, so we included it.

If we want to optimize for something else instead, what should we choose?

> OrderedBytes implements a bit-shifting strategy for this.
> {FixedLength,Terminated}Wrapper are provided to add flexibility. Ryan
> has suggested a variation of run-length encoding as another alternative,
> something we could add is there's sufficient need.

We went with the run-length encoding variant because in most cases, it 
decreases the size of the data or doesn't increase it too much. It 
increases the size only when there are single null bytes, in which case 
it adds a byte for each single null. Size is the same or reduced with 
two or more null bytes.

The reason for choosing this over the OB type is to support null bytes, 
and because OB adds ceil(size / 7) + 1 bytes to each value, and requires 
bit shifts to encode and decode.

>     * do we include 1 byte and 2 byte ints?
>
> Following the initial commit of HBASE-8201, these were requested HBASE-9369.

+1 for small ints

> The above date question is a perfece example of why I think it's
> important that we have the DataType interface. Having the interface
> means an application can implement it's own types when their needs are
> too unique for commit to HBase. Other applications can still use that
> implementation by including the relevant application jars. They enjoy
> interoperability by agreeing on the DataType implementation, not on
> something provided out of the box by a particular HBase version.

I think this spec would be a stronger interop guarantee. We should 
discuss whether we can support this spec along with existing data, 
although I suspect we probably can't.

rb

-- 
Ryan Blue
Software Engineer
Cloudera, Inc.

Mime
View raw message