accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Wonders <jwonder...@gmail.com>
Subject Re: compression of keys for a sequential scan over an inverted index
Date Sun, 25 Oct 2015 18:22:05 GMT
I have been able to put some more thought into this over the weekend and
make initial observations on tables I currently have populated.  Looking at
the rfile-info for a few different tables, I noticed that one which has
particularly small lexicographical deltas between keys costs an average of
~2.5 bits per key to store on disk.  All of the data is stored in the row
component of the key and a full row is typically about 36 bytes.  I wrote a
little utility to recreate ScanResult objects for batches of sequential
key-value pairs returned from a scanner and then used the TCompactProtocol to
write the ScanResult to a byte array.  Each key-value pair costs rougly 48
bytes which makes sense given that every row is different and there will be
some space required for the timestamps, visibilities, and other bookeeping
info.

Another table I looked at has larger lexicographical deltas between keys
and costs roughly 5 bytes per key to store on disk.  This table is a
reverse index with very large rows, each column within a row identifies
data that resides in another table.  Each column is rougly 12 bytes
uncompressed.  When encoded in a ScanResult, each key-value pair costs
roughly 25 bytes which makes sense since the row cost should be negligible
for large batch sizes and the overhead from timestamp, visibility, and
other bookkeeping info is roungly the same as the other table.

Since compression depends heavily on both table design and the actual data,
it seemed the next logical step would be to create a tool that the
community could use to easily measure the compression ratio for
ScanResult objects.
So, I threw together a shell extension to wrap the utility that I
previously described.  It measures compression ratio for the default
strategy Key.compress as well as a few other simple strategies that seemed
reasonable to test. The usage is almost the same as the scan command, it
just prints out compression statistics rather than the data.

It lives at https://github.com/jwonders/accumulo-experiments with branches
for Accumulo 1.6.x, 1.7.x, and 1.8.x.

Any feedback is welcome.  I hope others find this useful for understanding
this particular aspect of scan performance.

V/R
Jonathan


On Thu, Oct 22, 2015 at 4:37 PM, Jonathan Wonders <jwonders88@gmail.com>
wrote:

> Josh,
>
> Thanks for the information.  I did read through the discussion about
> compression of visibility expressions and columns within RFiles a while
> back which got me thinking about some of this. It makes sense that gzip or
> lzo/snappy compression would have a very noticeable impact when there are
> columns or visibility expressions that are not compressed with RLE even if
> neighboring rows have very small lexicographical deltas.
>
> I will put some thought into desiging an experiment to evaluate whether or
> not there is any benefit to applying RLE during key-value transport from
> server to client.  Even if it proves to be situationally beneficial, I
> think it could be implemented as a common iterator similar to the
> WholeRowIterator.
>
> Given, the current compression strategy I would expect better
> server-client transport compression retrieving a single row with many
> columns
>
> <key><value>        : <id>    []
>
> compared to many lexicographically close rows.
>
> <key><value><id>    :         []
>
> with the understanding that very large rows can lead to poor load
> balancing.
>
> V/R
> Jonathan
>
> On Thu, Oct 22, 2015 at 11:54 AM, Josh Elser <josh.elser@gmail.com> wrote:
>
>> Jonathan Wonders wrote:
>>
>>> I have been digging into some details of Accumulo to model the disk and
>>> network costs associated with various types of scan patterns and I have
>>> a few questions regarding compression.
>>>
>>> Assuming an inverted index table with rows following the pattern of
>>>
>>> <key><value><id>
>>>
>>> and a scan that specifies an exact key and value so as to constrain the
>>> range, it seems that the dominant factor in network utiltization would
>>> be sending key-value pairs from the tablet server to the client and a
>>> secondary factor would be transmitting data from non-local RFiles
>>> (assuming no caching).
>>>
>>
>> Sounds about right to me.
>>
>> Is my understanding correct that the on-disk compression of this type of
>>> table is predominantly a function of the average number of differing
>>> bits between adjacent ids?  Or, has anyone observed a significant
>>> improvement with gz or lzo vs no additional compression?  I'm
>>> considering running some experiments to measure the difference for a few
>>> types of ids (uuid, snowflake-like, content based hashes), but I'm
>>> curious if anyone else has done similar experiments.
>>>
>>
>> There is definitely a noticed difference between gzip, lzo/snappy
>> (they're about on par with speed and size), and no compression. I'm not
>> sure what the deltas are off the top of my head, but I would expect that
>> you see noticeable differences.
>>
>> You can also notice differences in "densities" when the lexicographical
>> delta between two keys. Sequential keys that differ very little will result
>> in very dense RFiles (lots of keys), where keys that vary greatly will
>> result in less keys per file. We've had some discussions about the
>> run-length encoding in RFile lately -- have you stumbled on that?
>>
>> Given a scan that specifies a range for an exact key and value, is there
>>> any transport compression performed for tablet server to client
>>> communication beyond the Key.compress method which appears to only
>>> compress equivalent rows, columns, etc as opposed to those that share a
>>> common prefix?
>>>
>>
>> No, there are no compression algorithms applied to the data before
>> sending it. By default we use the TCompactProtocol from Thrift. If you're
>> interested in the specifics, Thrift has some documentation on the subject.
>>
>> I do recall some experiments I tried previously in which a logical
>> "object" in Accumulo was comprised of multiple key-value pairs which were
>> returned in one key-value pair (very similar to how the WholeRowIterator
>> serializes things). In this case, I experimented with compressing the
>> serialized Key-Values before returning from the Iterator stack. Sadly, I
>> don't recall what the takeaway was, but I don't believe it was outright
>> "bad" :)
>>
>> It seems possible to implement a more specialized compression algorithm
>>> with the iterator framework, performing the decompression on the client
>>> side, but I'm curious if it could lead to general scan performance
>>> improvements if the default compression also involved run-length
>>> encoding.
>>>
>>
>> This would be really neat to test.
>>
>> Some rigor in designing an experiment would be the first step, IMO. If
>> you want to spear-head this, I'm sure there are many who would be happy to
>> give some feedback. It would be prudent to figure out what the variables in
>> the "equation" would be, specifically the distribution of Key-Value pairs
>> -- amount stored, compression on disk, cache sizes, query workload. First
>> out what you want to test, what you will tweak, and define a hypothesis you
>> want to test.
>>
>> You could consider using the continuous ingest framework for
>> data-generation and query workloads. You could also look at YCSB [1]
>> for some inspiration. Either of these would make your life easier in not
>> having to generate the read/write workloads.
>>
>>
>> [1] https://github.com/brianfrankcooper/YCSB
>>
>>
>> Any insight on this subject is much appreciated.
>>>
>>> V/R
>>> Jonathan
>>>
>>
>

Mime
View raw message