cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nadav Har'El (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-10728) Hash used in repair does not include partition key
Date Wed, 18 Nov 2015 14:53:11 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-10728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15011118#comment-15011118
] 

Nadav Har'El commented on CASSANDRA-10728:
------------------------------------------

What if in one replica we have partition 'a' with value 1 and some timestamp, and in a second
replica, we have partition 'b' with a value 1 and the same timestamp - and 'a' and 'b' happen
to be close enough in their tokens to be in the same Merkle Tree partition range?

I realize this is very unlikely case (especially considering the need for the timestamps to
be identical, which I haven't considered before). But it seems it's possible... For example,
consider a contrieved use case which uses Cassandra to store a large set of keys - the value
of each key is always set to "1" (or whatever). Now, at exactly the same time (at millisecond
resolution, which is Cassandra's default), two servers want to write two different keys "a"
and "b" - and because of a partition in the cluster, "a" ends up on one machine, "b" on a
second machine - and both have the same time (in miilisecond resolution, it's not completely
improbable).

> Hash used in repair does not include partition key
> --------------------------------------------------
>
>                 Key: CASSANDRA-10728
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10728
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Nadav Har'El
>            Priority: Minor
>
> When the repair code builds the Merkle Tree, it appears to be using AbstractCompactedRow.update()
to calculate a partition's hash. This method's documentation states that it calculates a "digest
with the data bytes of the row (not including row key or row size).". The code itself seems
to agree with this comment.
> However, I believe that not including the row (actually, partition) key in the hash function
is a mistake: This means that if two nodes have the same data but different key, repair would
not notice this discrepancy. Moreover, if two different keys have their data switched - or
have the same data - again this would not be noticed by repair. Actually running across this
problem in a real repair is not very likely, but I can imagine seeing it easily in an hypothetical
use case where all partitions have exactly the same data and just the partition key matters.
> I am sorry if I'm mistaken and the partition key is actually taken into account in the
Merkle tree, but I tried to find evidence that it does and failed. Glancing over the code,
it almost seems that it does use the key: Validator.add() calculates rowHash() which includes
the digest (without the partition key) *and* the key's token. But then, the code calls MerkleTree.TreeRange.addHash()
on that tuple, and that function conspicuously ignores the token, and only uses the digest.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message