couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Randall Leeds <randall.le...@gmail.com>
Subject Re: Reduce function to perform a Union?
Date Mon, 23 May 2011 23:25:53 GMT
On Fri, May 20, 2011 at 11:11, Jim Klo <jim.klo@sri.com> wrote:

> I was trying to not have to perform the merge outside of couch. Ultimately
> the problem we are trying to solve is a problem related to supporting the
> implementation of time based range queries and flow control over a
> contiguous range for OAI-PMH harvest support which has a requirement for
> consistency in range requests.
>
> Let me describe our application a bit. The attached image illustrates the
> system design, where the light blue circles are instances of the application
> services. Each node is connected to each other in an ad-hoc manner.  Some
> nodes may communicate bi-directionally, some may not. Our aim is to use
> filtered replication via Couch to distribute documents. At each node CouchDB
> is used to store a JSON document. Since the system is constantly receiving
> new content, storing a timestamp with the document is pointless, when
> considering that there's no way to guarantee the consistency of the global
> content set at any single point in time.  Updating the JSON doc with a new
> timestamp at insert/update (replication) would just cause the document to
> replicate again - causing a cascade effect. Currently all documents inserted
> into Couch are considered immutable in our design.
>
>
<snip big image>


>
> Consider a timestamped documents being replicated to a node from other
> node, using the date for simplicity, in the attached image. Looking at the
> replication timeline on a single node, where the content is influenced at
> any time.
>
> If at 3:00 I query Node X for all the Orange documents between 4/1 and 4/4,
> I get one object back.  The consistency of the result changes at 4:00 if I
> query Node X again for all Orange documents between 4/1 and 4/4. OAI-PMH
> protocol requires that the result remains consistent.
> Ideally what I need to be able to execute a query like at 3:00 and 6:00
> where I ask for all the Orange documents received at between 1:00 and 3:00.
> Regardless of the time I can always get the stream of documents received
> between a specific time slice, matching some specific criteria.
>
> Effectively what I need to be able to do is be able to create a view using
> the local sequence of the document with other document traits (keywords,
> publisher, etc).  If there's a way to do this, then my problem is solved.  I
> haven't found this ability to be in couch, hence, what we've done is
> essentially on replication/publish, we have a change listener that inserts a
> 'sidecar document' containing a local timestamp (sequence) that which we can
> use document linking or view collation w/ a reduce to figure out the right
> result set with a minimal number of transforms to be eventually returned to
> the consumer.
>
>

> Using a document linking method like:
>
> function(doc) {
> if (doc.doc_type == "resource_data_timestamp" && doc.node_timestamp) {
> emit(doc.node_timestamp, { "_id": doc.resource_doc_id } );
> }
> }
>
> where the resource_data_timestamp doc looks something like:
>
> {
>    "_id": "d2018b3fe169426b95e44a5580692d5a-timestamp",
>    "_rev": "1-719655db4a1df9b9efcc5edbd62289ed",
>    "doc_type": "resource_data_timestamp",
>    "resource_doc_id": "d2018b3fe169426b95e44a5580692d5a",
>    "doc_version": "0.20.0",
>    "node_timestamp": "2011-05-19T22:09:55.704004Z"
> }
>
>
> we can perform a range query against the view by timestamp only and get the
> original doc via include_docs=true, but would then have to table scan to
> filter out documents that aren't "orange", which could be millions of
> records, in our use case. This method tho only lets me join 2 documents
> together.... We're also trying to determine a method of handling a delete
> policy, potentially using a tombstone document as well - which is where the
> collated key with map and reduce come into play, since there could be more
> than one type of "sidecar" document.
>
> Is this making sense?  Really we're looking for any solution that we can
> ensure a consistent range result with some additional filtering thats
> relatively practical - that most importantly can scale.
>
> Any advice would be great.
>

Thanks for the clear explanation. I'd love to try to help you out here.

In general I'm totally a fan of the immutable document/"sidecar" approach.
I can't think of a good way to solve all your problems easily yet, but I'll
keep ruminating. Anyway, this is what makes it all fun and worthwhile,
right?

Listening on /_changes and inserting a timestamp document in response to
receiving replicated documents allows, like you said, to have an index with
the consistency guarantees you want.

Providing you're only searching for one "color" at a time (and not trying to
do arbitrary intersection/union of tags or something), geocouch might give
you the queries you need.

For example:
Listen on /_changes and when you receive a document (from replication or
user write), PUT a sidecar document into the database with the color field
from the original document.
Then, using geocouch, build a spatial map that emits [timestamp, color] as
point geometry for these documents.
Finding a time range of orange documents becomes a bounding box query in
this setup.
Naturally, take care to use filtering so sidecar documents don't replicate.
Again, unions of colors won't work like this, so maybe that's a non-starter.

Sound like this is heading in the right direction?

While I know it's generally advised to make views deterministic, and I might
get beaten with a stick for saying this, you *could* generate the node-local
timestamp in the map function using the current time...

Anyway, I'm super interested in your use case and I'd like to help you solve
this, so keep me in the loop!

Regards,
Randall

Mime
View raw message