accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From William Slacum <>
Subject Re: compression of keys for a sequential scan over an inverted index
Date Mon, 26 Oct 2015 14:55:24 GMT
Thanks, Jonathan! I've wondered about specific numbers on this topic when
dealing with geohashes, so this is a very useful tool.

On Sun, Oct 25, 2015 at 11:22 AM, Jonathan Wonders <>

> I have been able to put some more thought into this over the weekend and
> make initial observations on tables I currently have populated.  Looking at
> the rfile-info for a few different tables, I noticed that one which has
> particularly small lexicographical deltas between keys costs an average of
> ~2.5 bits per key to store on disk.  All of the data is stored in the row
> component of the key and a full row is typically about 36 bytes.  I wrote a
> little utility to recreate ScanResult objects for batches of sequential
> key-value pairs returned from a scanner and then used the TCompactProtocol
> to write the ScanResult to a byte array.  Each key-value pair costs
> rougly 48 bytes which makes sense given that every row is different and
> there will be some space required for the timestamps, visibilities, and
> other bookeeping info.
> Another table I looked at has larger lexicographical deltas between keys
> and costs roughly 5 bytes per key to store on disk.  This table is a
> reverse index with very large rows, each column within a row identifies
> data that resides in another table.  Each column is rougly 12 bytes
> uncompressed.  When encoded in a ScanResult, each key-value pair costs
> roughly 25 bytes which makes sense since the row cost should be negligible
> for large batch sizes and the overhead from timestamp, visibility, and
> other bookkeeping info is roungly the same as the other table.
> Since compression depends heavily on both table design and the actual
> data, it seemed the next logical step would be to create a tool that the
> community could use to easily measure the compression ratio for ScanResult objects.
> So, I threw together a shell extension to wrap the utility that I
> previously described.  It measures compression ratio for the default
> strategy Key.compress as well as a few other simple strategies that seemed
> reasonable to test. The usage is almost the same as the scan command, it
> just prints out compression statistics rather than the data.
> It lives at with
> branches for Accumulo 1.6.x, 1.7.x, and 1.8.x.
> Any feedback is welcome.  I hope others find this useful for understanding
> this particular aspect of scan performance.
> V/R
> Jonathan
> On Thu, Oct 22, 2015 at 4:37 PM, Jonathan Wonders <>
> wrote:
>> Josh,
>> Thanks for the information.  I did read through the discussion about
>> compression of visibility expressions and columns within RFiles a while
>> back which got me thinking about some of this. It makes sense that gzip or
>> lzo/snappy compression would have a very noticeable impact when there are
>> columns or visibility expressions that are not compressed with RLE even if
>> neighboring rows have very small lexicographical deltas.
>> I will put some thought into desiging an experiment to evaluate whether
>> or not there is any benefit to applying RLE during key-value transport from
>> server to client.  Even if it proves to be situationally beneficial, I
>> think it could be implemented as a common iterator similar to the
>> WholeRowIterator.
>> Given, the current compression strategy I would expect better
>> server-client transport compression retrieving a single row with many
>> columns
>> <key><value>        : <id>    []
>> compared to many lexicographically close rows.
>> <key><value><id>    :         []
>> with the understanding that very large rows can lead to poor load
>> balancing.
>> V/R
>> Jonathan
>> On Thu, Oct 22, 2015 at 11:54 AM, Josh Elser <>
>> wrote:
>>> Jonathan Wonders wrote:
>>>> I have been digging into some details of Accumulo to model the disk and
>>>> network costs associated with various types of scan patterns and I have
>>>> a few questions regarding compression.
>>>> Assuming an inverted index table with rows following the pattern of
>>>> <key><value><id>
>>>> and a scan that specifies an exact key and value so as to constrain the
>>>> range, it seems that the dominant factor in network utiltization would
>>>> be sending key-value pairs from the tablet server to the client and a
>>>> secondary factor would be transmitting data from non-local RFiles
>>>> (assuming no caching).
>>> Sounds about right to me.
>>> Is my understanding correct that the on-disk compression of this type of
>>>> table is predominantly a function of the average number of differing
>>>> bits between adjacent ids?  Or, has anyone observed a significant
>>>> improvement with gz or lzo vs no additional compression?  I'm
>>>> considering running some experiments to measure the difference for a few
>>>> types of ids (uuid, snowflake-like, content based hashes), but I'm
>>>> curious if anyone else has done similar experiments.
>>> There is definitely a noticed difference between gzip, lzo/snappy
>>> (they're about on par with speed and size), and no compression. I'm not
>>> sure what the deltas are off the top of my head, but I would expect that
>>> you see noticeable differences.
>>> You can also notice differences in "densities" when the lexicographical
>>> delta between two keys. Sequential keys that differ very little will result
>>> in very dense RFiles (lots of keys), where keys that vary greatly will
>>> result in less keys per file. We've had some discussions about the
>>> run-length encoding in RFile lately -- have you stumbled on that?
>>> Given a scan that specifies a range for an exact key and value, is there
>>>> any transport compression performed for tablet server to client
>>>> communication beyond the Key.compress method which appears to only
>>>> compress equivalent rows, columns, etc as opposed to those that share a
>>>> common prefix?
>>> No, there are no compression algorithms applied to the data before
>>> sending it. By default we use the TCompactProtocol from Thrift. If you're
>>> interested in the specifics, Thrift has some documentation on the subject.
>>> I do recall some experiments I tried previously in which a logical
>>> "object" in Accumulo was comprised of multiple key-value pairs which were
>>> returned in one key-value pair (very similar to how the WholeRowIterator
>>> serializes things). In this case, I experimented with compressing the
>>> serialized Key-Values before returning from the Iterator stack. Sadly, I
>>> don't recall what the takeaway was, but I don't believe it was outright
>>> "bad" :)
>>> It seems possible to implement a more specialized compression algorithm
>>>> with the iterator framework, performing the decompression on the client
>>>> side, but I'm curious if it could lead to general scan performance
>>>> improvements if the default compression also involved run-length
>>>> encoding.
>>> This would be really neat to test.
>>> Some rigor in designing an experiment would be the first step, IMO. If
>>> you want to spear-head this, I'm sure there are many who would be happy to
>>> give some feedback. It would be prudent to figure out what the variables in
>>> the "equation" would be, specifically the distribution of Key-Value pairs
>>> -- amount stored, compression on disk, cache sizes, query workload. First
>>> out what you want to test, what you will tweak, and define a hypothesis you
>>> want to test.
>>> You could consider using the continuous ingest framework for
>>> data-generation and query workloads. You could also look at YCSB [1]
>>> for some inspiration. Either of these would make your life easier in not
>>> having to generate the read/write workloads.
>>> [1]
>>> Any insight on this subject is much appreciated.
>>>> V/R
>>>> Jonathan

View raw message