cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Pavel Yaskevich (JIRA)" <j...@apache.org>
Subject [jira] [Issue Comment Edited] (CASSANDRA-47) SSTable compression
Date Fri, 08 Jul 2011 22:24:18 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-47?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13062204#comment-13062204
] 

Pavel Yaskevich edited comment on CASSANDRA-47 at 7/8/11 10:24 PM:
-------------------------------------------------------------------

Patch introduces CompressedDataFile with Input/Output classes. Snappy is used for compression/decompression
because it showed better speeds in tests comparing to ning. Files are split into 4 bytes +
64kb chunks where 4 bytes hold information about compressed chunk size, not that current SSTable
file format is preserved and no modifications were made to index, statistics or filter components.
Both Input and Output classes extend RandomAccessFile so random I/O works as expected.

All SSTable files are opened using CompressedDataFile.Input. On startup when SSTableReader.open
gets called it first checks if data file is already compressed and compresses if it was not
already compressed so users won't have a problem after they update.

At the header of the file it reserves 8 bytes for a "real data size" so other components of
the system that use SSTables and SSTables itself have no idea that data file is compressed.

Streaming of data file sends decompressed chunks for convenience of maintaing transfer and
receiving party compresses all data before write to the backing file (see CompressedDataFile.transfer(...)
and CompressedFileReceiver class).

Tests are showing dramatic performance increase when reading 1 million rows created with 1024
bytes random values. Current code takes >> 1000 secs to read but with current path only
175 secs. Using 64kb buffer 1.7GB file could be compressed into 110MB (data added using ./bin/stress
-n 1000000 -S 1024 -V, where -V option generates average size values and different cardinality
from 50 (default) to 250).

Writes perform a bit better like 5-10%. 

      was (Author: xedin):
    Patch introduces CompressedDataFile with Input/Output classes. Snappy is used for compression/decompression
because it showed better speeds in tests comparing to ning. Files are split into 4 bytes +
64kb chunks where 4 bytes hold information about compressed chunk size, not that current SSTable
file format is preserved and no modifications were made to index, statistics or filter components.
Both Input and Output classes extend RandomAccessFile so random I/O works as expected.

All SSTable files are opened using CompressedDataFile.Input. On startup when SSTableReader.open
gets called it first checks if data file is already compressed and compresses if it was not
already compressed so users won't have a problem after they update.

At the header of the file it reserves 8 bytes for a "real data size" so other components of
the system that use SSTables and SSTables itself have no idea that data file is compressed.

Streaming of data file sends decompressed chunks for convenience of maintaing transfer and
receiving party compresses all data before write to the backing file (see CompressedDataFile.transfer(...)
and CompressedFileReceiver class).

Tests are showing dramatic performance increase when reading 1 million rows created with 1024
bytes random values. Current code takes >> 1000 secs to read but with current path only
175 secs. Using 64kb buffer 1.7GB file could be compressed into 110MB (data added using ./bin/stress
-n 1000000 -S 1024 -V, where -V option generates random values).

Writes perform a bit better like 5-10%. 
  
> SSTable compression
> -------------------
>
>                 Key: CASSANDRA-47
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-47
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Pavel Yaskevich
>              Labels: compression
>             Fix For: 1.0
>
>         Attachments: CASSANDRA-47.patch, snappy-java-1.0.3-rc4.jar
>
>
> We should be able to do SSTable compression which would trade CPU for I/O (almost always
a good trade).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message