hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stack <st...@duboce.net>
Subject Re: FastDiffDeltaEncoder improvements
Date Fri, 17 Mar 2017 17:04:13 GMT
On Mon, Sep 12, 2016 at 12:46 AM, Дмитрий <dimka-747@i.ua> wrote:

> Hi all,
> I would like to discuss available implementations of data block encoding
> in HBase and how we can improve them.

(Sorry, missed the first posting)

Killer w/ encoders/decoders is allocation of a block to encode/decode into.
If encoding/decoding could be inlined, that'd help loads.

Other possible improvements would be keeping the encoding as you traversed
hbase; i.e. keeping content encoded while in case; being able to merge sort
encoded blocks or pick a Cell out of an encoded block w/o first having to
undo it all.

Have you done any profiling of codecs to see where we are slow?

The most interesting for me is FastDiffDeltaEncoder because it encodes not
> only keys but also anothers fields
> like timestamp, type, keyLen, etc. Also it removes duplicated values and
> it is the most controversial feature
> as for me. Look at following image:
> [IMG]http://i68.tinypic.com/8z2wzn.png[/IMG]
The image does not work for me. Does it work for you?

> This is an example of small table with row keys: Row-1, Row-2, Row-3 and
> columns Column-A, Column-B, Column-C.
> DataBlockEncoder encodes cells ordered by keys. Each key consists of
> RowKey, Family and Qualifier. That's why
> we will encode cells in order which is displayed by blue line in the image.
> FastDiffDeltaEncoder calculates difference between two serial cells. In
> this way duplicated values in Column-A
> will not be removed. The only case when it works it is in single column
> tables.
> So, my suggestion is to detect duplicates in columns, not only in
> neighboring cells. Also I've heard an idea
> not just to remove duplicated values, but to calculate prefix difference
> between them, like for keys.
You've had a look at hbase-prefix-tree, an old contribution that
unfortunately has seen little use. It slowed writing significantly but made
for nice improvements at read time. There is some overlap between what you
are thinking and the work done there.

> To implement this we have to keep previous value for each column. The most
> efficient way in my opinion is to
> keep them in HashMap using ByteArrayWrapper for keys. Size of this map
> will be the same as count of unique
> columns in the encoding block.
So, what would you write out? Blocks that had this Map appended as metadata
or would this metadata be on the file itself?

> It looks very easy to implement this but I guess there must be some hidden
> obstacles, because this has not
> implemented yet.
> What do you think about the idea? Is there more efficient way (by
> CPU/Memory) to keep previous values?
> Should I try to implement prefix delta encoding for values?
Any improvements to be had encoding/decoding would be much appreciated.

You have a dataset you can play with? A profiling workbench?

Thank you for asking,

> -- реклама -----------------------------------------------------------
> Огромный выбор и скидки на телевизоры на Palladium.ua!
> http://goo.gl/HBFW3x

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message