accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject Re: compression of keys for a sequential scan over an inverted index
Date Mon, 26 Oct 2015 15:06:53 GMT
Definitely. This is pretty cool. Thanks for sharing.

I'll have to try to make some time to poke around with this :)

William Slacum wrote:
> Thanks, Jonathan! I've wondered about specific numbers on this topic
> when dealing with geohashes, so this is a very useful tool.
>
> On Sun, Oct 25, 2015 at 11:22 AM, Jonathan Wonders <jwonders88@gmail.com
> <mailto:jwonders88@gmail.com>> wrote:
>
>     I have been able to put some more thought into this over the weekend
>     and make initial observations on tables I currently have populated.
>     Looking at the rfile-info for a few different tables, I noticed that
>     one which has particularly small lexicographical deltas between keys
>     costs an average of ~2.5 bits per key to store on disk.  All of the
>     data is stored in the row component of the key and a full row is
>     typically about 36 bytes.  I wrote a little utility to recreate
>     ScanResultobjects for batches of sequential key-value pairs returned
>     from a scanner and then used the TCompactProtocol to write the
>     ScanResult to a byte array.  Each key-value pair costs rougly 48
>     bytes which makes sense given that every row is different and there
>     will be some space required for the timestamps, visibilities, and
>     other bookeeping info.
>
>     Another table I looked at has larger lexicographical deltas between
>     keys and costs roughly 5 bytes per key to store on disk.  This table
>     is a reverse index with very large rows, each column within a row
>     identifies data that resides in another table.  Each column is
>     rougly 12 bytes uncompressed.  When encoded in a ScanResult, each
>     key-value pair costs roughly 25 bytes which makes sense since the
>     row cost should be negligible for large batch sizes and the overhead
>     from timestamp, visibility, and other bookkeeping info is roungly
>     the same as the other table.
>
>     Since compression depends heavily on both table design and the
>     actual data, it seemed the next logical step would be to create a
>     tool that the community could use to easily measure the compression
>     ratio for ScanResultobjects. So, I threw together a shell extension
>     to wrap the utility that I previously described.  It measures
>     compression ratio for the default strategy Key.compress as well as a
>     few other simple strategies that seemed reasonable to test. The
>     usage is almost the same as the scan command, it just prints out
>     compression statistics rather than the data.
>
>     It lives at https://github.com/jwonders/accumulo-experiments with
>     branches for Accumulo 1.6.x, 1.7.x, and 1.8.x.
>
>     Any feedback is welcome.  I hope others find this useful for
>     understanding this particular aspect of scan performance.
>
>     V/R
>     Jonathan
>
>
>     On Thu, Oct 22, 2015 at 4:37 PM, Jonathan Wonders
>     <jwonders88@gmail.com <mailto:jwonders88@gmail.com>> wrote:
>
>         Josh,
>
>         Thanks for the information.  I did read through the discussion
>         about compression of visibility expressions and columns within
>         RFiles a while back which got me thinking about some of this. It
>         makes sense that gzip or lzo/snappy compression would have a
>         very noticeable impact when there are columns or visibility
>         expressions that are not compressed with RLE even if neighboring
>         rows have very small lexicographical deltas.
>
>         I will put some thought into desiging an experiment to evaluate
>         whether or not there is any benefit to applying RLE during
>         key-value transport from server to client.  Even if it proves to
>         be situationally beneficial, I think it could be implemented as
>         a common iterator similar to the WholeRowIterator.
>
>         Given, the current compression strategy I would expect better
>         server-client transport compression retrieving a single row with
>         many columns
>
>         <key><value>        : <id>    []
>
>         compared to many lexicographically close rows.
>
>         <key><value><id>    :         []
>
>         with the understanding that very large rows can lead to poor
>         load balancing.
>
>         V/R
>         Jonathan
>
>         On Thu, Oct 22, 2015 at 11:54 AM, Josh Elser
>         <josh.elser@gmail.com <mailto:josh.elser@gmail.com>> wrote:
>
>             Jonathan Wonders wrote:
>
>                 I have been digging into some details of Accumulo to
>                 model the disk and
>                 network costs associated with various types of scan
>                 patterns and I have
>                 a few questions regarding compression.
>
>                 Assuming an inverted index table with rows following the
>                 pattern of
>
>                 <key><value><id>
>
>                 and a scan that specifies an exact key and value so as
>                 to constrain the
>                 range, it seems that the dominant factor in network
>                 utiltization would
>                 be sending key-value pairs from the tablet server to the
>                 client and a
>                 secondary factor would be transmitting data from
>                 non-local RFiles
>                 (assuming no caching).
>
>
>             Sounds about right to me.
>
>                 Is my understanding correct that the on-disk compression
>                 of this type of
>                 table is predominantly a function of the average number
>                 of differing
>                 bits between adjacent ids?  Or, has anyone observed a
>                 significant
>                 improvement with gz or lzo vs no additional
>                 compression?  I'm
>                 considering running some experiments to measure the
>                 difference for a few
>                 types of ids (uuid, snowflake-like, content based
>                 hashes), but I'm
>                 curious if anyone else has done similar experiments.
>
>
>             There is definitely a noticed difference between gzip,
>             lzo/snappy (they're about on par with speed and size), and
>             no compression. I'm not sure what the deltas are off the top
>             of my head, but I would expect that you see noticeable
>             differences.
>
>             You can also notice differences in "densities" when the
>             lexicographical delta between two keys. Sequential keys that
>             differ very little will result in very dense RFiles (lots of
>             keys), where keys that vary greatly will result in less keys
>             per file. We've had some discussions about the run-length
>             encoding in RFile lately -- have you stumbled on that?
>
>                 Given a scan that specifies a range for an exact key and
>                 value, is there
>                 any transport compression performed for tablet server to
>                 client
>                 communication beyond the Key.compress method which
>                 appears to only
>                 compress equivalent rows, columns, etc as opposed to
>                 those that share a
>                 common prefix?
>
>
>             No, there are no compression algorithms applied to the data
>             before sending it. By default we use the TCompactProtocol
>             from Thrift. If you're interested in the specifics, Thrift
>             has some documentation on the subject.
>
>             I do recall some experiments I tried previously in which a
>             logical "object" in Accumulo was comprised of multiple
>             key-value pairs which were returned in one key-value pair
>             (very similar to how the WholeRowIterator serializes
>             things). In this case, I experimented with compressing the
>             serialized Key-Values before returning from the Iterator
>             stack. Sadly, I don't recall what the takeaway was, but I
>             don't believe it was outright "bad" :)
>
>                 It seems possible to implement a more specialized
>                 compression algorithm
>                 with the iterator framework, performing the
>                 decompression on the client
>                 side, but I'm curious if it could lead to general scan
>                 performance
>                 improvements if the default compression also involved
>                 run-length encoding.
>
>
>             This would be really neat to test.
>
>             Some rigor in designing an experiment would be the first
>             step, IMO. If you want to spear-head this, I'm sure there
>             are many who would be happy to give some feedback. It would
>             be prudent to figure out what the variables in the
>             "equation" would be, specifically the distribution of
>             Key-Value pairs -- amount stored, compression on disk, cache
>             sizes, query workload. First out what you want to test, what
>             you will tweak, and define a hypothesis you want to test.
>
>             You could consider using the continuous ingest framework for
>             data-generation and query workloads. You could also look at
>             YCSB [1]
>             for some inspiration. Either of these would make your life
>             easier in not having to generate the read/write workloads.
>
>
>             [1] https://github.com/brianfrankcooper/YCSB
>
>
>                 Any insight on this subject is much appreciated.
>
>                 V/R
>                 Jonathan
>
>
>
>

Mime
View raw message