Mailing-List: contact couchdb-user-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: couchdb-user@incubator.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Message-Id: <2E9206AB-EFBF-4D50-A9C9-098529E96E23@rodanotech.ch>
From: Alexander Lamb <alexander.lamb@rodanotech.ch>
To: couchdb-user@incubator.apache.org
In-Reply-To: <CAAC40E2-1067-4722-8259-050CB5337532@gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed; delsp=yes
Content-Transfer-Encoding: quoted-printable
Mime-Version: 1.0 (Apple Message framework v919.2)
Subject: Re: Relying on revisions for rollbacks
Date: Tue, 18 Mar 2008 09:03:44 +0100
References: <1205753351.7678.18.camel@localhost>
 <828B1624-4992-4BE5-ADED-EB252DC88BD8@apache.org>
 <b7cd8ed10803170847y66c75d48p4bc833b781388981@mail.gmail.com>
 <2DE218F0-6231-4382-B6BA-3638020D2851@apache.org>
 <47DEBD13.30303@theopenlearningcentre.com>
 <CAAC40E2-1067-4722-8259-050CB5337532@gmail.com>

Juste so I understand:

attaching previous versions as attachements is:

1) Last version of document containing a list of attachements

or

2) Last version of document containing previous version as =20
attachement, which itself contains previous version as attachement, =20
etc...

If the answer is (2), then merging updates from several servers might =20=

really be difficult !

If the answer is (1), merging is simpler but it is not very easy to =20
generate a version number, except using revision dates.

Ultimately, the reasons to keep revisions (in what I am considering =20
using couchdb for) are:

1) audit trail (for legal reasons) which means not only "show me who =20
changed what when in document X" but "show me a set of documents as =20
they were on Jan-3-2008 10:28"
2) have different document "status": archived (e.g. can't be changed), =20=

published (for global use), published locally, work in progress (only =20=

for the user editing)

Point 2 is important because it means a document can be "live" with =20
several different revisions and depending who you are in the system, =20
you get to see one or another revision.

It actually means that it should be easy to write views which say for =20=

example:

"give me all published document + all my work in progress documents"

Since there could be many published revisions, it is actually "give me =20=

the last revision with published status + last revision with work in =20
progress status"

Then, when I finished working on my "work in progress" document I want =20=

to store it as "published" and delete all revisions with status work =20
in progress I created between last published document and my new =20
version...

In summary, what I am describing here is rather generic in document =20
management systems. Do we want this as custom built code, as actually =20=

part of CouchDb or as an optional layer on top of CouchDb ?

My 2 euro cents :-)

Alex

Le 17 mars 08 =E0 20:52, Damien Katz a =E9crit :

> On Mar 17, 2008, at 2:48 PM, Alan Bell wrote:
>
>> Jan Lehnardt wrote:
>>>
>>> You can do that, too. With attachments, you'd have it all in one
>>> place and would not need to write your views in a way that they
>>> don't pick up old revisions. That said, it is certainly possible to
>>> store older revisions in other documents, if that solves your
>>> problems.
>>>
>>> Cheers
>>> Jan
>>> --=20
>> well I might be missing something about the way couchdb handles =20
>> attachments but this doesn't sound good to me. Adding attachments =20
>> to hold the revision history means that the attachments have to be =20=

>> replicated each time a revision happens.
>
> Right now, this is true. But with attachment level incremental =20
> replication then only attachments that have changed will replicate.
>
>> Also a replication conflict is pretty much the same thing as a =20
>> revision, a client application would have no knowledge of a =20
>> replication conflict happening but this would be good to see in a =20
>> wiki-like page history. I can imagine in a distributed system it =20
>> would be very hard for the clients to maintain a revision history =20
>> as attachments.
>
> I disagree about the difficulty. It's surprisingly simple =20
> conceptually.
>
> The first thing is, every time you update the document, simply =20
> attach the previous revision when you save. Eventually there will be =20=

> a flag you can pass in to do this automatically.
>
> Then, if there is a replication conflict to resolve, simply open the =20=

> two conflicting documents (manually if necessary), update your =20
> chosen winner with any info you want to preserve from the loser =20
> (data, revision histories, etc) , then delete the loser revision.
>
> And that's it. The thing about this system is you can get very =20
> simple or very complicated with the revision history aspects, it's =20
> up to the application developer. The nice thing is you generally =20
> don't need to worry about concurrent or distributed updates with =20
> other nodes attempting the same thing. The same rules still apply =20
> and eventually the conflicts will be resolved.
>
>> As for writing views to not pick up old revisions, I think all =20
>> applications should assume that all documents are at all times =20
>> carrying a bundle of prior versions and replication/save conflicts. =20=

>> One of the nasty things in Notes is that most applications assume =20
>> that replication conflicts don't happen and can break when they do =20=

>> happen. I think a major feature of Couchdb is sensible handling of =20=

>> revisions and conflicts. Purging revisions and conflicts is going =20
>> to be necessary for some applications, but in others it is =20
>> desirable to retain all versions. It would be good at least to be =20
>> able to specify which databases to run compaction on and which to =20
>> exclude.
>
> The scheduling of compaction is something that will be external to =20
> the core database code. Much of the work here isn't in the actually =20=

> file level compaction code, but in creating tools to monitor things =20=

> and initiate it with desired options.
>
>>
>> What is the proposed rule for compaction? Just deleting all =20
>> revisions it finds? Deleting old revisions over a certain age?
>
>
> For the first cut of compaction, it will unconditionally purge all =20
> previous revisions of a document from a database, leaving only the =20
> most recent revisions of the winner and it's conflicts.
>
> Then we will provide a way to perform selective purging during =20
> compaction, probably with a user provided function will be fed each =20=

> document at compaction time, and it will return true or false if the =20=

> document should be kept or discarded. This is also how deletion =20
> "stubs" will be purged as well (keeping some meta info about deleted =20=

> documents is necessary for replication).
>
>>
>> Another thought, it would be nice perhaps to run compaction on some =20=

>> servers but not on others for replicas of the same database. Thus a =20=

>> bunch of offline clients could compact fairly frequently and =20
>> aggressively, however a central server they all replicate with that =20=

>> has lots of disk space could retain all versions.
>
> Ok, that's a neat use case but I'm not sure how you would handle the =20=

> intermediate edits replicating back to the server. Maybe they just =20
> get lost. It seems possible to support such a thing without a lot of =20=

> work. We'll see what is possible.
>
>
>> I am thinking in particular of the scenario of OLPC XO laptops =20
>> replicating with a school server.
>
>>
>>
>> Alan.
>