cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benedict (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-2698) Instrument repair to be able to assess it's efficiency (precision)
Date Wed, 31 Jul 2013 19:05:52 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-2698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725604#comment-13725604
] 

Benedict commented on CASSANDRA-2698:
-------------------------------------

Hi Yuki,

Had some fun rebasing, but think everything looks good now. A few things to note:

1) I'm not sure what you mean by not "serializing those" - for correctness I serialize all
of the data in a node. Do you want me to change the serialization methods to not send these
values? I don't log them the other end, but I would prefer they were sent to ensure no surprises
for users of the data, and also because of some optimisations to difference() that rely on
knowing the number of rows for each sub-tree. It's not a tremendous amount of data after all.

2) I've modified DifferencerTest, and created two versions of the testDifference() method
- one that tests differences on an empty tree, and one which tests a tree that has been populated
with rows. Previously only the former was tested. This is because the changes I made to difference()
for my previous patch, which I have retained and which ensures contiguous ranges are emitted
where possible, treats the entire empty tree as one contiguous difference range (since the
only non-empty sub-range in the tree is different), which was breaking the previous test.
This test now works with the fully populated tree, and the previous test now confirms that
the whole tree is considered different when it is empty. It's possible you may want to not
deploy these improvements in this patch, but it seems a good idea to me whilst it's being
modified, and given that I'd made the change already. Since we're not logging the ranges themselves
at this time it won't have any direct impact, but it will be useful if that ever changes,
and might help with future debugging.

3) I've updated the MerkleTreeTest methods to test the serialization and difference changes,
and introduced a new HistogramBuilderTest

4) The histogram is built differently from my first patch, and is described in HistogramBuilder.
Basically rather than creating neat linear ranges, I calculate the mean and create ranges
that are multiples of the standard deviation either side of the mean, up to min/max (or, in
this case, 3 stdevs, plus one range to min/max)

5) One thing we might want to consider changing is the format of the EstimatedHistogram ranges
in the log messages. I've reproduced faithfully the boundary conventions of the EstimatedHistogram,
but this is not a user friendly convention - it has an exclusive lower bound and inclusive
upper bound, as opposed to the typical opposite convention. As such we get ranges like (-1,
0] to represent the range containing only 0, as opposed to [0, 1)

Think that's everything. Should respond quickly to queries at the moment, so drop me a line
if you have any questions.
                
> Instrument repair to be able to assess it's efficiency (precision)
> ------------------------------------------------------------------
>
>                 Key: CASSANDRA-2698
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2698
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Sylvain Lebresne
>            Assignee: Benedict
>            Priority: Minor
>              Labels: lhf
>         Attachments: nodetool_repair_and_cfhistogram.tar.gz, patch_2698_v1.txt, patch.diff,
patch-rebased.diff, patch.taketwo.alpha.diff
>
>
> Some reports indicate that repair sometime transfer huge amounts of data. One hypothesis
is that the merkle tree precision may deteriorate too much at some data size. To check this
hypothesis, it would be reasonably to gather statistic during the merkle tree building of
how many rows each merkle tree range account for (and the size that this represent). It is
probably an interesting statistic to have anyway.   

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message