cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benedict (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-9060) Anticompaction hangs on bloom filter bitset serialization
Date Sat, 28 Mar 2015 13:03:52 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-9060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385290#comment-14385290
] 

Benedict commented on CASSANDRA-9060:
-------------------------------------

CASSANDRA-8670 will give the best option for this, but in the meantime (I think this fix should
go into 2.1, personally, since it is trivial and likely to have significant impact - it's
kind of amazing this oversight has gone unnoticed for so long, so thanks for pointing it out).


Looking at it, I'm not at all convinced by the wrapping/unwrapping of the longs either, since
our DataOutput implementations all just convert the writeLong() into a series of write(byte)
calls. But the simplest, least invasive solution to this, is to indeed pass a BufferedOutputStream()
into a DataOutputStreamPlus, rather than constructing a DataOutputStreamAndChannel. For 2.1
I think we should make this tiny change. 

For 3.0, I think we should wait for CASSANDRA-8670 and think through the wrapping/unwrapping
of long business, and see if there is a clearer route. Perhaps version bump, so we can simply
stream the raw bytes to disk without any conversion, since that makes the most sense - there's
no reason to be flipping bytes whatsoever here, since we always index into the data by byte.
If we want to maintain serialization format, we could buffer segments of the filter into a
ByteBuffer/Memory object, and use Long.reverseBytes() prior to flushing that buffered data
to disk. On reading we could populate the entire bitset, then iterate through reversing the
bytes as we go. I would prefer to see the on disk representation match the in-memory though.

As to the antiCompaction calculation, that isn't my area but your conclusion seems reasonable
to me. Taking a look at the code, it seems we would need to somehow correct for the ratio
of each sstable we expect to be on each side of the range, which might lead to one side obtaining
a worse than expected fp, with the other obtaining a better. Whereas right now both receive
significantly better false positive ratios. I'm not sure how effectively we could better deal
with this (at least without a bit more research effort and thought). [~krummas]?

> Anticompaction hangs on bloom filter bitset serialization 
> ----------------------------------------------------------
>
>                 Key: CASSANDRA-9060
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9060
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Gustav Munkby
>            Priority: Minor
>             Fix For: 3.0
>
>         Attachments: trunk-9060.patch
>
>
> I tried running an incremental repair against a 15-node vnode-cluster with roughly 500GB
data running on 2.1.3-SNAPSHOT, without performing the suggested migration steps. I manually
chose a small range for the repair (using --start/end-token). The actual repair part took
almost no time at all, but the anticompactions took a lot of time (not surprisingly).
> Obviously, this might not be the ideal way to run incremental repairs, but I wanted to
look into what made the whole process so slow. The results were rather surprising. The majority
of the time was spent serializing bloom filters.
> The reason seemed to be two-fold. First, the bloom-filters generated were huge (probably
because the original SSTables were large). With a proper migration to incremental repairs,
I'm guessing this would not happen. Secondly, however, the bloom filters were being written
to the output one byte at a time (with quite a few type-conversions on the way) to transform
the little-endian in-memory representation to the big-endian on-disk representation.
> I have implemented a solution where big-endian is used in-memory as well as on-disk,
which obviously makes de-/serialization much, much faster. This introduces some slight overhead
when checking the bloom filter, but I can't see how that would be problematic. An obvious
alternative would be to still perform the serialization/deserialization using a byte array,
but perform the byte-order swap there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message