couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Anderson <jch...@apache.org>
Subject Re: Reducible checksum?
Date Thu, 10 Dec 2009 17:34:18 GMT
On Wed, Dec 9, 2009 at 1:31 AM, Brian Candler <B.Candler@pobox.com> wrote:
> On Mon, Dec 07, 2009 at 02:27:18PM -0800, Chris Anderson wrote:
>> If there's a generic way to do this, and it
>> is cheap enough, it could be generalized to handle view etags. Your
>> row count + max timestamp trick seems sensible to me, but obviously is
>> not generalizable.
>>
>> Presumably you could avoid hashing the keys and values by leaning on
>> the document._rev. However, that just pushes the problem back a step.
>
> Aha. These days the _rev includes a cryptographic checksum of the document
> contents, correct?
>
> In that case I think all we need is a simple sum, modulo 2^128, of the _rev
> hashes. This is commutative and associative, and very cheap. (An XOR has the
> problem that if the same document is emitted an even number of times, it
> vanishes).
>
> Now, we know that the set of (k,v) pair(s) emitted for any particular doc
> can't change unless you modify the design doc. So we could simply add this
> sum to a hash of the design doc as a final step, to get the etag. You'd also
> have to include the view params in the final hash (e.g. startkey, endkey,
> skip) because it's possible that the same set of docs could appear under
> different parts of the key space, emitting different keys and/or values, and
> you'd want different etags for those.
>
> This leaves one question: I know the view index already contains
> {{key,_id},value}, but does it include _rev? If not, would the extra storage
> overhead be acceptable?

There have been request for _rev in the view rows in the past but
we've always responded with the recommendation to just emit the rev as
part of your value.

If we were to use a solution like this for the view etags we'd want to
hard-code it, but for now you should be able to prototype your
approach with the current API.

>
> The alternative I can see is to take a hash of each {{key,_id},value} and to
> sum those hashes. The trouble is this is more expensive computation-wise,
> especially if the value is large (e.g. when you emit the whole document).
>
> Cheers,
>
> Brian.
>



-- 
Chris Anderson
http://jchrisa.net
http://couch.io

Mime
View raw message