hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "dhruba borthakur (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-5313) Restructure hfiles layout for better compression
Date Fri, 09 Mar 2012 19:25:00 GMT

    [ https://issues.apache.org/jira/browse/HBASE-5313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13226351#comment-13226351
] 

dhruba borthakur commented on HBASE-5313:
-----------------------------------------

I am guessing that initially, we keep this new "columnar encoding" completely isolated inside
a HFileBlock. At table creation time, one can specify that the table be stored in columnar-encoded
fashion.

A HFile will have information in the FixedFileTrailer that specifies whether the data inside
the hfile is in columnar-format. A single HFileBlock will have four "column-entity": all the
rowkeys will be laid out first, followed by all the cf, followed by all the "column names",
followed by the timestamps, followed by the memstoreTS, followed by all the values.

If 'prefix-encoding' is enabled, then each column-entity will be prefix encoded individually.
If compression (lzo, gz, etc) is enabled, the entire HFileBlock will be compressed accordingly.

Prefix-encoding will work well for the rowkey entity and the column-family entity. The column
name entity may need to be sorted and then prefix encoded. The timestamp entity may need special
kind of encoding. One option (suggested by a co-worker Chip Turner) is to take the first timestamp
as the base and xor it with each of the following timestamps (thus, zeroing out the common
bits) and then storing it. Another option is to find the minimum timestamp in the block and
then store diffs from that minimum value. Yet another option is to make Jan-01-2012 as the
hbase-epoch and store the difference from that number.

                
> Restructure hfiles layout for better compression
> ------------------------------------------------
>
>                 Key: HBASE-5313
>                 URL: https://issues.apache.org/jira/browse/HBASE-5313
>             Project: HBase
>          Issue Type: Improvement
>          Components: io
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>
> A HFile block contain a stream of key-values. Can we can organize these kvs on the disk
in a better way so that we get much greater compression ratios?
> One option (thanks Prakash) is to store all the keys in the beginning of the block (let's
call this the key-section) and then store all their corresponding values towards the end of
the block. This will allow us to not-even decompress the values when we are scanning and skipping
over rows in the block.
> Any other ideas? 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message