couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sho Fukamachi <sho.fukama...@gmail.com>
Subject Re: when to use another document and when not to?
Date Tue, 05 Aug 2008 05:15:27 GMT

On 28/07/2008, at 3:18 AM, Paul Carey wrote:

> At the risk of misunderstanding / stating the obvious, there's nothing
> to stop you doing both 2 and 3 and eliminating the n+1 query loop.
> Each user can maintain an array of both those he follows and those who
> follow him. You could update both follower and leader in a bulk
> transaction to ensure consistency.

At this stage I don't think there's anything obvious to state! Yes -  
this could be option 4 on that list, storing the relationship on  
*both* sides, which does indeed solve the basic problem of how to get  
one from the other minus the n+1 query.

I had also considered this solution but I think there are a few  
problems with this "multi-master" style of recording this information.

Firstly, as far as I can see it has replication problems.

  For example, say you've got two servers, one in Japan and one in the  
USA. A user in the US adds a tag to a photo, both Photo and Tag  
records are updated. Meanwhile the owner of the photo in Japan  
successfully deletes an inappropriate tag from the same photo. What  
happens when the two servers then try to reconcile these records a  
couple of minutes later? The tag records will be fine, the photo  
record will be in conflict with two new versions of the doc - one has  
a tag ID deleted, the other has one added. The servers have no way of  
knowing what to do without going back and trying to rebuild from the  
tags, which will require intelligence on the program side - in a well- 
designed system this condition should never be allowed to arise.

Second, it doesn't solve the "contention" problem for popular records.  
In fact it magnifies it since now the updates are happening on both  
sides. This makes the above replication issue worse. In a photos  
situation, imagine a few more things you're trying to store in the  
Photo document - users who have made it a favourite, for example. Now  
imagine the photo gets linked on Digg and a few thousand people try to  
make it a favourite and tag it in the space of a few minutes. Your  
users are likely "operating" on a copy of the actual record - they  
have in effect "checked it out" and are making their changes (adding  
tags, favourites, etc). Obviously this is going to be very bad as when  
they check it back in, any changes in the meantime are lost - unless  
you start added logic to deal with all of that. But you can see it  
would get complex and doesn't really play to couch's strengths.

Third, this is kind of related to the above two - if you use this  
technique for caching various forms of data, you have no inbuilt way  
of checking if that data has changed. Say you want to cache the  
username of all users who made this photo a favourite. The user then  
changes their username. Your cache is out of date. You'd have to write  
something to go and look at every single one of these caches and  
update it with the new username. Again this seems to be a bad way of  
doing it and requires too much "intelligence".

Fourth is kind of a philosophical problem which many may not agree  
with, and it may be the RDBMS devil on my shoulder speaking, but to me  
the membership (tag relationship, follower relationship, whatever) is  
a discrete piece of data and should have its own document. Having this  
discrete information existing solely in the metadata of other records  
kind of bothers me. This kind of ties in with the other points as well  
- I want the relationships to be rebuildable. If the arrays on both  
sides become inconsistent, there should be some way of regenerating  
them.

So for these reasons I think that just storing the array on both sides  
is a bad idea. From thinking about this I keep coming back to the  
"membership" doc as being a necessity. With a few improvements on the  
previous implementation.

The approach I have settled on (for now) is that you do create the  
Membership document, and then you and cache all the information you  
need in it - including the revs of the two other objects it refers to.  
So you might have TagMembership, and it includes the Photo name, photo  
ID, photo rev, and all that for the Tag as well.

It gains you:

- a "canonical" document - the data is not "multi-master" and is  
easily replicable
- because you have the _revs of both remote docs you can detect  
obsoletions in an automated "dumb" way - if a user changes their name,  
for example, you could grab every membership doc that doesn't have the  
new user_rev and bulk update them.
- one relationship, one doc, no contention problems
- you can then grab all the tag names for a specific Photo ID, or all  
the photo names for a specific Tag ID, in a single view
- using the membership doc as a source, you can then go and cache data  
on either side at will if you really want to - the important point  
being that it's rebuildable from the "canonical" doc

This seems to me to be the best way of doing this for now, I'd like to  
hear any arguments or other ideas. If I am not making sense then I can  
provide code examples if anyone would like that...

Sho




> I've created a simple example. I used photos and tags instead of
> followers because I find the self referentiality of the follower model
> adds unnecessary confusion, but the underlying concept - a many to
> many relationship - remains.
> http://friendpaste.com/ev5DAJTR
>
> Cheers
>
> Paul


Mime
View raw message