hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Gray <jg...@fb.com>
Subject RE: prefix compression
Date Thu, 02 Jun 2011 23:44:07 GMT
I'm here!

Still parsing through the stuff in this e-mail but I agree that many approaches will require
rejiggering of KeyValue in a significant way.  I have made some attempts at this but nothing
is very far.  It's definitely something that we are hoping to put some resources into over
the summer.

JG

> -----Original Message-----
> From: Todd Lipcon [mailto:todd@cloudera.com]
> Sent: Thursday, June 02, 2011 3:37 PM
> To: dev@hbase.apache.org
> Subject: Re: prefix compression
> 
> Hey Matt,
> 
> Interesting email, and also something I've been thinking about recently.
> 
> Unfortunately, I think one of the big prerequisites before we can start
> thinking about actual compression algorithms is some refactoring around
> how KeyValue is used. Currently, KeyValue exposes byte arrays in a lot of
> places
> - these byte arrays are expected to have discrete sub-ranges that represent
> row key, column qualifier, etc. Various parts of the code directly grab these
> byte arrays and offsets and do stuff with them (eg comparison, serialization,
> etc).
> 
> In order to split the different components up, or do true prefix compression,
> we'd need to be able to have KeyValue do some smarter things for
> comparison, and never call functions like "getBuffer), getRowKeyOffset(),
> getRowKeyLength()" expecting that to show us a byte array range with a row
> key in it.
> 
> I think Jonathan Gray has some code hacked together for this, but last I
> heard it wasn't in shippable state... Jonathan, you out there?
> 
> As for the particular compression algorithms, a nice first step would be a very
> simplistic one: the first row key of any HFile block would be given in full.
> Following that, each key uses a single byte token which identifies the
> number of key components shared with the previous row (ie same row,
> same col, different ts). This isn't quite as good as true prefix compression,
> but for common cases where each cell has three versions, and each row has
> many columns, it's probably a huge savings.
> 
> -Todd
> 
> On Thu, Jun 2, 2011 at 3:28 PM, Matt Corgan <mcorgan@hotpads.com>
> wrote:
> 
> > I've been reading up on prefix compression for a few weeks trying to
> > figure out the best approach.  Maybe some of you have noticed that 90%
> > of my emails to the user list have something to do with it...  Now
> > that people are working on HFile v2 and the topic of Lucene's FST has
> > come up, I thought it would be a good time to share some thoughts.  My
> > goal here is definitely not asking anyone to implement this, but just
> > to start a discussion that gets the proper jiras in place.  I have
> > spent a little time coding an initial rowKey trie builder that could
> > be the start of a pluggable implementation.
> >
> > This is a big topic to address in one email, but I don't think the
> > pieces make sense on their own.  Here's an outline with some thoughts:
> >
> > * refer to prefix compression as "compaction" to avoid interfering
> > with traditional "compression"
> >
> > * main goal is to reduce size of data in memory to increase effective
> > capacity of block index, block cache, and memstores
> >    * disk access takes thousands of times longer than memory access,
> > especially over hdfs/network
> >
> > * secondary goal is to speed up the processing of in memory data
> >    * memory access takes 200+ cpu cycles, so must use cache friendly
> > algorithms
> >    * structure trie node sizes around typical 64B cache line size
> >    * investigate java's memory prefetching capabilities, possibly
> > increasing to ~256B nodes
> >        * research
> > paper<
> >
> http://www.google.com/url?sa=t&source=web&cd=5&sqi=2&ved=0CDcQFj
> AE&url
> >
> =http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3
> D10.1.
> >
> 1.68.4872%26rep%3Drep1%26type%3Dpdf&rct=j&q=databases%20prefetchi
> ng&ei
> > =7tnnTcSWLuXTiALSpNCUDA&usg=AFQjCNFpVC9oDsG1-
> UG0x6ZsqsIHLqvGDA&sig2=o0
> > 96l8cNrnX69oeBGChUKg
> > >about
> > modern memory databases and prefetching
> >    * almost always use variable length integer/long encoding
> >
> > * need separate compacted structures/algorithms for rowKey,
> > colQualifier, and timestamp
> >    * structures will need to interact with varInt pointers to link
> > rowKey<->colQualifier<->timestamp
> >
> > * i think in the medium to long term (and maybe even short term) it
> > would be best to code this from scratch
> >
> >
> > Static row key compaction for block cache and data blocks:
> > * avoid schemes where each row is a diff of the previous (expensive
> > cpu)
> > * utilize the fact that row keys are always sorted (unlike column
> > qualifiers)
> >    * this makes for fast tree construction during flushing and
> > compaction
> > * low hanging fruit: (rowKeyCompaction=simple)
> >    * pull off the common prefix of all rows
> >    * have a flag indicating if row key is repeat of the previous
> > * more aggressive: use modified Cache Sensitive Search Trees<
> > http://www.cs.columbia.edu/~library/TR-repository/reports/reports-1998
> > /cucs-019-98.pdf
> > >
> > (rowKeyCompaction=cssTree64,
> > cssTree256, etc)
> >    * modified to accept full byte range 0-255, store full byte strings
> > in the tree, remove all pointers, optimize for cache line size, etc,
> > but similar idea
> >    * CSS trees are considered sub-par because they are not easily
> > modified, but that is ok in hbase's block cache and data blocks
> > * input is a sorted list of byte[], output is a byte[] containing the
> > entire prefix trie
> > * possibly put all keys in the front of the block, and all values at
> > the end, with value offsets/lengths in the keys
> >
> > Backwards compatibility
> > * block cache could quickly compute the trie when loaded, therefore
> > working with HFile v1
> >    * eventually want to store the computed trie on disk
> > * HFile v2 will need to store information about the pluggable data
> > block compaction schemes
> >
> >
> > Memstore dynamic row key compaction:
> > * this will be much more difficult than the static compaction because
> > it needs to accept unsorted byte[]'s and support row locking
> > * use modified C-Burstsort<
> > http://ww2.cs.mu.oz.au/~rsinha/papers/SinhaRingZobel-2006.pdf>
> > for
> > dynamic unsorted data
> > * fast and compact (best dynamic string sorting algorithm i could
> > find)
> > * fragmentation friendly (not original goal, but ends up working like
> > MSLAB)
> >    * allocates large (customizable) byte[] for holding multiple leaf values
> >    * "bursts" the byte[] when they fill up, creating multiple child
> > arrays
> > * will need to add some sort of lockable object to the leaf of the
> > rowKey trie
> >
> >
> > Column qualifier compaction:
> > * config param colQualifierCompaction=(none, lookupTable, cssTree64,
> > etc)
> > * use lookupTable when column qualifiers are a small bounded set
> >    * store the lookup table in memory (similar to blockIndex or
> > fixedFileTrailer)
> >    * on second thought, this could be difficult without 2-pass
> > compaction, but maybe still possible
> > * use cssTree when a large or unbounded set, like when appending a
> > numerical suffix in wide rows
> >
> >
> > Timestamp compaction:
> > * timestampCompaction=(none, hfileDiff, cssTree, flatten)
> > * each hfile needs a lowestTimestamp stored in the trailer
> > * hflieDiff would store a varLong relative to the lowestTimestamp
> > * cssTree would do the same as it does for rowKeys, but the trie
> > overhead might wipe out the space savings with such small values
> > * flatten would not store a timestamp in each record, rather use the
> > hfile's lowestTimestamp for every record
> >    * many use cases don't care about the timestamp
> >    * i think this works for maxVersions=1.  may need more elaborate
> > scheme for multiple versions
> >
> >
> > I'll stop there...  sorry for the huge email!  I haven't seen much
> > discussion on email or jira about how prefix compression should be
> > done, so I hope this provides somewhat of a starting point.
> >
> > Matt
> >
> 
> 
> 
> --
> Todd Lipcon
> Software Engineer, Cloudera

Mime
View raw message