couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Candler <B.Cand...@pobox.com>
Subject Re: getting unique set of document id's
Date Tue, 07 Jul 2009 21:01:09 GMT
On Tue, Jul 07, 2009 at 01:08:14PM -0500, Ross Bates wrote:
>    Thank you for the follow up Brian. After looking at your examples I
>    think I understand where I wasn't clear in how I planned to use the
>    POST {"keys": ["foo", "bar"]} statement.
>    What you and Paul suggested is the fastest route to getting documents
>    that are tagged with both foo AND bar, but for my search I am hoping to
>    get foo AND/OR bar.

OK. Then POST {"keys": ["foo", "bar"]} will get you all the documents, and
you can unique them in the client.

>    The reason I was including the doc id in the emit was that I was trying
>    to play with the reduce to see if I could get it to "reduce" it to a
>    unique set.

In fact the docid is already passed to the reduce function; the "keys" array
is actually [key,docid] pairs. But a reduce function is the wrong place to
put this. If you have 1,000,000 documents then the root reduce node would
contain 1,000,000 docids, and that would perform really, really badly. (Note
that each Btree node contains the reduced value of all the docs in that
node, and hence the root Btree node contains the reduced value across all
docs in your database)

>    I can use just the map() function then make the result unique on the
>    client, but from a performance standpoint I imagine the couchdb view
>    already knows what those unique id's are. I could be totally wrong
>    there though.

It does know the ids, but the view will emit in order of keys. CouchDB would
need to build up a list of unique IDs while traversing the tree, and it
doesn't do that. It just returns them as it sees them.

So it will walk the Btree index from for all keys="foo", then walk the Btree
index for all keys="bar", and emit the docids.

You can use a Hash or similar data structure on the client to mark doc ids
seen. Actually, since they will be emitted in increasing docid order (for
equal keys) then you could perform a merge instead.

>    The original foo/bar example is simple, but extending it out would
>    prove to be powerful for set based analysis (database marketing,
>    customer segmentation). To be able to pass {"keys": ["foo",
>    "bar","baz","boo","xyz"]} to a view and get back a set of docs that
>    matched one or more of the keys would be fantastic.

That's pretty much what you get now - with the proviso that the doc may be
returned up to N times if it matches N keys.

Regards,

Brian.

Mime
View raw message