hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "He Yongqiang (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-5313) Restructure hfiles layout for better compression
Date Sat, 11 Feb 2012 00:01:06 GMT

    [ https://issues.apache.org/jira/browse/HBASE-5313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13205892#comment-13205892
] 

He Yongqiang commented on HBASE-5313:
-------------------------------------

@Todd, with such a small block size and data also already sorted, i was also thinking it is
will be very hard to optimize the space.

So we did some experiments by modifying today's HFileWriter. It turns out it can still save
a lot if we play more tricks.

Here are test results (block size is 16KB):

*42MB HFile, with Delta compression and with LZO compression* (with default setting on Apache
trunk)

*30MB HFile, with Columnar, with Delta compression, and with LZO compression.*

Inside one block, first put all row keys inside that block, and do delta compression, and
then LZO compression. After row key, put all column family data in that block, and do Delta+LZO
for it. And then similarly put column_qualifier. etc

*24MB HFile, with Columnar, Sort value column, Sort column_qualifier column, and with LZO
compression.*

Inside one block, first put all row keys inside that block, and do delta compression, and
then LZO compression. After row key, put all column family data in that block, and do Delta+LZO
for it. And then put column_qualifier, sort it, and then do Delta+LZO. TS column and Code
column are processed the same as column family. The value column is processed the same as
column_qualifier. So it is the same as disk format for the 30MB HFile, except all data for
'column_qualifier' and 'value' are sorted separately.

Out of 24MB file, 6MB is used to store row keys, 7MB is used to store column_qualifier, and
6MB is to store value.

More ideas are welcome! 

                
> Restructure hfiles layout for better compression
> ------------------------------------------------
>
>                 Key: HBASE-5313
>                 URL: https://issues.apache.org/jira/browse/HBASE-5313
>             Project: HBase
>          Issue Type: Improvement
>          Components: io
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>
> A HFile block contain a stream of key-values. Can we can organize these kvs on the disk
in a better way so that we get much greater compression ratios?
> One option (thanks Prakash) is to store all the keys in the beginning of the block (let's
call this the key-section) and then store all their corresponding values towards the end of
the block. This will allow us to not-even decompress the values when we are scanning and skipping
over rows in the block.
> Any other ideas? 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message