cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sylvain Lebresne (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-47) SSTable compression
Date Tue, 12 Jul 2011 20:05:01 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-47?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13064101#comment-13064101
] 

Sylvain Lebresne commented on CASSANDRA-47:
-------------------------------------------

bq. As I wrote before - currently to check real size of the file (tested only on linux because
OS X FS saves empty blocks to the disk for some reason) you need to get a block count using
'ls -alhs', current patch reserves an empty space for each chunk because we need to do seeks
while we write data using SSTableWriter.

Yeah, I really think we shouldn't do that (i.e, have empty space between the compressed chunks).
I'm happy to learn that linux (or at least whatever file system you are using, I haven't tried
the patch on linux yet) is smart enough to avoid allocating empty blocks but we shouldn't
rely on this. I bet not all file system do that (osx seems to prove that and I'm not sure
all linux FS does this) and anyway if you transfer the sstables or tar them or anything, it'll
still be more inefficient than necessary (because the file still *is* of the size of the uncompressed
data). We're also losing some space even on linux depending on what the actual FS block size
is (not a big deal, but this can add up). So I think we really need to change the index (and
key cache) to store the offset in compressed data. Imho, the simplest way would be to instead
of having in the index the key followed by the offset, to have for compressed file, the key,
then the position of the chunk in the compressed file, then the offset in the uncompressed
chunk.

Another thing is that we will need that to be optional (if only because we cannot expect people
to trust this from day one). Don't get me wrong, it's nice to have a first prototype to have
an idea of what we're talking about, but I just wanted to mention this because it's probably
easier to take that into account sooner than later (I also suspect we may be able to factor
out some of the code of BRAF and CDF, but I haven't look too closely so maybe not).  



> SSTable compression
> -------------------
>
>                 Key: CASSANDRA-47
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-47
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Pavel Yaskevich
>              Labels: compression
>             Fix For: 1.0
>
>         Attachments: CASSANDRA-47.patch, snappy-java-1.0.3-rc4.jar
>
>
> We should be able to do SSTable compression which would trade CPU for I/O (almost always
a good trade).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

Mime
View raw message