couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From eric casteleijn <eric.castele...@canonical.com>
Subject Re: couch sequence numbers and _all_docs_by_seq
Date Wed, 06 May 2009 08:22:28 GMT
> If I remember correctly your problem, you want to process documents
> when they first cross the transom into your cloud. If they are then
> replicated out and about and come back again later, you don't need to
> reprocess them. Of course anything involving client-changeable data is
> not a 100% guarantee, but if you can live with occasionally
> reprocessing documents, then you might try something like:
> 
> A view of all docs which do not have the cloud_processed property. And
> then a process which is always trying to keep that view empty by
> processing the docs it lists in whatever manner you need.

This is exactly what we do now: there is an update trigger, that queries 
for unprocessed documents, and bulk updates them to have a 'first_seen' 
field that holds the sequence number that the database was at when the 
trigger was fired.

The reason I would like to have access to the sequence number of 
documents in my views is similar but different: It would allow me to 
write a view that gets all the documents of a particular type that were 
last updated between two sequence numbers, without relying on an id 
prefix, which feels awkward, and is problematic for us, since we have 
URIs for document types, which obviously won't work as part of the id, 
so we'd have to keep a mapping from type to prefix, and that is another 
step away from simplicity.

> Of course you'll be relying on clients to trigger "download" (from the
> cloud to their local) replication about as often as they trigger
> "upload" replication, otherwise your process will start to stack up
> docs in a conflict state.
> 
> The other solution I think we talked about was maintaining an
> independent database in the cloud, which just tracks which
> document-ids have been processed. This avoids the conflicts scenario,
> and when you think about what it means to the disks, it's about the
> same cost as maintaining that view. However, you end up querying it
> over and over again for each document you see, instead of just seeing
> the relevant docs.

That is still a solution we might have to choose, but even if it's not a 
  performance problem, it increases code complexity.

> I'd do whatever possible to avoid recording update_seq at the
> application level, as CouchDB is not designed to make multi-node
> guarantees about that property.

Yes, that way madness lies, and I'm not suggesting that. All I'd like is 
for views I create myself to be able to use _seq as a key (or possibly 
value) like _all_docs_by_seq does, to have more efficient ways of 
querying for data that changed either within the node or through 
replication, i.e. with a single view, rather than through calling 
_all_docs_by_seq and filtering in application code.

-- 
- eric casteleijn
http://www.canonical.com

Mime
View raw message