cassandra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephen Connolly <stephen.alan.conno...@gmail.com>
Subject Re: Simple Compression Idea
Date Mon, 31 Jan 2011 08:19:13 GMT
On 31 January 2011 04:41, David G. Boney <dboney1@semanticartifacts.com> wrote:
> I propose a simple idea for compression using a compressed string datatype.
>
> The compressed string datatype could be implemented for column family keys by creating
a compressed string ordered partitioner. The compressed string ordered partitioner works by
decompressing the string and then applying an ordered partitioner for strings to the decompressed
string. The hash based partitioner would be used with the compressed string without any modification.
The compressed string datatype could be implemented for column keys by creating a compressed
string comparator. A compressed string comparator would work by decompressing the string and
then applying a string comparator. The compressed string datatype could be implemented for
column values. The compressed string datatype would be an internal datatype for Cassandra.
The compressed string would be converted to a string before returning the value to a client.
I suppose you could also have an option of passing the compressed form back to the client
if the client wanted it that way.
>
> I propose using an adaptive arithmetic coding compressor. This type of compression can
be done a byte at a time. It will build a compression model only on the string that is presented,
a byte at a time. See the below papers.
>
> Moffat, Alistair, Radford M. Neal, & Ian H. Witten, (1998), "Arithmetic Coding Revisited",
ACM Trans. on Info. Systems, Vol. 16, No. 3, pp. 256-294.
>
> Witten, Ian H., Radford M. Neal, & John G. Cleary, (1987), "Arithmetic Coding for
Data Compression", Communications of the ACM, Vol. 30, No. 6, pp. 520-540.
>
> It has been reported that arithmetic coding based compression applied to text can get
compression ratios of up to 2.2 bits per character. Assuming you only get 4 bits per character
because of short strings. This would be a 50% compression of text data, including keys and
column names. Many applications would benefit. It should speed up the overall operation of
Cassandra because you would be moving significantly less data through the system.

I have not read those papers but how do they and their reported
compressions apply to the unicode characters that java strings are
stored as?

-Stephen

>
> This would provide a compression option that could be implemented without any redesign
to the internal structure of Cassandra except for the a new partitioner class, a new comparator
class, a new datatype class, and the compression class.
> -------------
> Sincerely,
> David G. Boney
> dboney1@semanticartifacts.com
> http://www.semanticartifacts.com
>
>
>
>
>

Mime
View raw message