incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jim Klo <jim....@sri.com>
Subject Re: Reduce function to perform a Union?
Date Tue, 24 May 2011 06:42:00 GMT
Response below:


On May 23, 2011, at 4:25 PM, Randall Leeds <randall.leeds@gmail.com> wrote:

> On Fri, May 20, 2011 at 11:11, Jim Klo <jim.klo@sri.com> wrote:
> I was trying to not have to perform the merge outside of couch. Ultimately the problem
we are trying to solve is a problem related to supporting the implementation of time based
range queries and flow control over a contiguous range for OAI-PMH harvest support which has
a requirement for consistency in range requests.
> 
> Let me describe our application a bit. The attached image illustrates the system design,
where the light blue circles are instances of the application services. Each node is connected
to each other in an ad-hoc manner.  Some nodes may communicate bi-directionally, some may
not. Our aim is to use filtered replication via Couch to distribute documents. At each node
CouchDB is used to store a JSON document. Since the system is constantly receiving new content,
storing a timestamp with the document is pointless, when considering that there's no way to
guarantee the consistency of the global content set at any single point in time.  Updating
the JSON doc with a new timestamp at insert/update (replication) would just cause the document
to replicate again - causing a cascade effect. Currently all documents inserted into Couch
are considered immutable in our design.
> 
> 
> <snip big image>
>  
> 
> Consider a timestamped documents being replicated to a node from other node, using the
date for simplicity, in the attached image. Looking at the replication timeline on a single
node, where the content is influenced at any time.
> 
> If at 3:00 I query Node X for all the Orange documents between 4/1 and 4/4, I get one
object back.  The consistency of the result changes at 4:00 if I query Node X again for all
Orange documents between 4/1 and 4/4. OAI-PMH protocol requires that the result remains consistent.
> <Dock-3.jpg>
> Ideally what I need to be able to execute a query like at 3:00 and 6:00 where I ask for
all the Orange documents received at between 1:00 and 3:00. Regardless of the time I can always
get the stream of documents received between a specific time slice, matching some specific
criteria.
> 
> Effectively what I need to be able to do is be able to create a view using the local
sequence of the document with other document traits (keywords, publisher, etc).  If there's
a way to do this, then my problem is solved.  I haven't found this ability to be in couch,
hence, what we've done is essentially on replication/publish, we have a change listener that
inserts a 'sidecar document' containing a local timestamp (sequence) that which we can use
document linking or view collation w/ a reduce to figure out the right result set with a minimal
number of transforms to be eventually returned to the consumer.
>  
> 
> Using a document linking method like:
> 
> function(doc) {
> 	if (doc.doc_type == "resource_data_timestamp" && doc.node_timestamp) {
> 		emit(doc.node_timestamp, { "_id": doc.resource_doc_id } );
> 	}
> }
> 
> where the resource_data_timestamp doc looks something like:
> 
> {
>    "_id": "d2018b3fe169426b95e44a5580692d5a-timestamp",
>    "_rev": "1-719655db4a1df9b9efcc5edbd62289ed",
>    "doc_type": "resource_data_timestamp",
>    "resource_doc_id": "d2018b3fe169426b95e44a5580692d5a",
>    "doc_version": "0.20.0",
>    "node_timestamp": "2011-05-19T22:09:55.704004Z"
> }
> 
> we can perform a range query against the view by timestamp only and get the original
doc via include_docs=true, but would then have to table scan to filter out documents that
aren't "orange", which could be millions of records, in our use case. This method tho only
lets me join 2 documents together.... We're also trying to determine a method of handling
a delete policy, potentially using a tombstone document as well - which is where the collated
key with map and reduce come into play, since there could be more than one type of "sidecar"
document.
> 
> Is this making sense?  Really we're looking for any solution that we can ensure a consistent
range result with some additional filtering thats relatively practical - that most importantly
can scale.
> 
> Any advice would be great.
> 
> Thanks for the clear explanation. I'd love to try to help you out here.
> 
> In general I'm totally a fan of the immutable document/"sidecar" approach.
> I can't think of a good way to solve all your problems easily yet, but I'll keep ruminating.
Anyway, this is what makes it all fun and worthwhile, right?
> 
> Listening on /_changes and inserting a timestamp document in response to receiving replicated
documents allows, like you said, to have an index with the consistency guarantees you want.
> 
> Providing you're only searching for one "color" at a time (and not trying to do arbitrary
intersection/union of tags or something), geocouch might give you the queries you need.
> 
> For example:
> Listen on /_changes and when you receive a document (from replication or user write),
PUT a sidecar document into the database with the color field from the original document.
> Then, using geocouch, build a spatial map that emits [timestamp, color] as point geometry
for these documents.
> Finding a time range of orange documents becomes a bounding box query in this setup.
> Naturally, take care to use filtering so sidecar documents don't replicate.
> Again, unions of colors won't work like this, so maybe that's a non-starter.
> 

I had considered the geocouch approach, but wasn't sure if I could model it that way. I assumed
you had to use point geometry directly - I'm only using color as an example (Which can be
plotted numerically). In reality the filtering key is most likely a crowdsourced value like
keyword or schema names.  

That still doesn't solve field replication within the "sidecar" doc. Eventually as things
progress - I can see how the sidecar doc becomes a complete copy + timestamp. Creating a wasted
space situation. 


> Sound like this is heading in the right direction?
> 
> While I know it's generally advised to make views deterministic, and I might get beaten
with a stick for saying this, you *could* generate the node-local timestamp in the map function
using the current time...
> 

I might be the one that does that! ;-) I'm not sure that works reliably though - I'm assuming
you mean just requesting the view on the _change event, and using something like new Date().toString()?
It would work as long as the view never gets rebuilt, which I'm not sure we could guarantee
long term. 

> Anyway, I'm super interested in your use case and I'd like to help you solve this, so
keep me in the loop!

Thanks! I'm curious how many others have a similar use case. It doesn't seem like it should
be uncommon - but maybe the only users of replicated couchdb data so far have no flow consistency
needs, which is different from the eventual consistency model replication follows, for those
who might think I'm derranged and confused. I keep wondering how tough would it really be
to expose the local sequence value to a view? Technically it seems like it's already being
done with _changes somehow, it just needs to be accessible from map/reduce/list/show.  Having
this sort of feature seems like it would open couch up to a whole other class of solution
applications. 

> 
> Regards,
> Randall

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message