hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "stack (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
Date Mon, 26 Mar 2012 18:18:29 GMT

    [ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13238645#comment-13238645
] 

stack commented on HBASE-4676:
------------------------------

bq. The KeyValueScanner and related classes that iterate through the trie will eventually
need to be smarter and have methods to do things like skipping to the next row of results
without scanning every cell in between.

Do we not have this now Matt?

As said before, 'So 40-80x faster seeks at medium block size.', is pretty nice improvement.
 Ditto on 'Seek speed is ~15x NONE seek speed while expanding the effective block cache size
by 4x.'

What are you thinking these times Matt?  The slow write and scan speeds would tend to rule
it out for anything but a random read workload but when random reading only, you'll want to
use prefixtrie (smile).  Do you think it possible to approach prefix compression scan and
write speeds?

Regards your working set read testing, did it all fit in memory?

How did you measure cycles per seek?

What is numBlocks?  The total number of blocks the dataset fit in?

Throughput is better for sure.. what is latency like?

Excellent stuff Matt.




                
> Prefix Compression - Trie data block encoding
> ---------------------------------------------
>
>                 Key: HBASE-4676
>                 URL: https://issues.apache.org/jira/browse/HBASE-4676
>             Project: HBase
>          Issue Type: New Feature
>          Components: io, performance, regionserver
>    Affects Versions: 0.90.6
>            Reporter: Matt Corgan
>         Attachments: PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec
by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for applications
that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is somewhat metadata
heavy, so there can be tremendous memory bloat for many common data layouts, specifically
those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This means that every
time you double the datablock size, average seek time (or average cpu consumption) goes up
by a factor of 2.  The standard 64KB block size is ~10x slower for random seeks than a 4KB
block size, but block sizes as small as 4KB cause problems elsewhere.  Using block sizes of
256KB or 1MB or more may be more efficient from a disk access and block-cache perspective
in many big-data applications, but doing so is infeasible from a random seek perspective.
> The PrefixTrie block encoding format attempts to solve both of these problems.  Some
features:
> * trie format for row key encoding completely eliminates duplicate row keys and encodes
similar row keys into a standard trie structure which also saves a lot of space
> * the column family is currently stored once at the beginning of each block.  this could
easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which caters nicely
to wide rows.  duplicate qualifers between rows are eliminated.  the size of this trie determines
the width of the block's qualifier fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas are calculated
from that.  the maximum delta determines the width of the block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for the row trie,
then the column trie, then the timestamp deltas, and then then all the values.  Most work
is done in the row trie, where every leaf node (corresponding to a row) contains a list of
offsets/references corresponding to the cells in that row.  Each cell is fixed-width to enable
binary searching and is represented by [1 byte operationType, X bytes qualifier offset, X
bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell overhead.
 Same for timestamps.  Same for qualifiers when i get a chance.  So, the compression aspect
is very strong, but makes a few small sacrifices on VarInt size to enable faster binary searches
in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying further (suffix,
etc) compression on the trie nodes at the cost of slower write speed.  Even further compression
could be obtained by using all VInts instead of FInts with a sacrifice on random seek speed
(though not huge).
> One current drawback is the current write speed.  While programmed with good constructs
like TreeMaps, ByteBuffers, binary searches, etc, it's not programmed with the same level
of optimization as the read path.  Work will need to be done to optimize the data structures
used for encoding and could probably show a 10x increase.  It will still be slower than delta
encoding, but with a much higher decode speed.  I have not yet created a thorough benchmark
for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient (probably within
half or a quarter of its max read speed) the way that hbase currently uses it is far from
optimal.  The KeyValueScanner and related classes that iterate through the trie will eventually
need to be smarter and have methods to do things like skipping to the next row of results
without scanning every cell in between.  When that is accomplished it will also allow much
faster compactions because the full row key will not have to be compared as often as it is
now.
> Current code is on github.  The trie code is in a separate project than the slightly
modified hbase.  There is an hbase project there as well with the DeltaEncoding patch applied,
and it builds on top of that.
> https://github.com/hotpads/hbase/tree/prefix-trie-1
> https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners
> I'll follow up later with more implementation ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message