incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Candler <B.Cand...@pobox.com>
Subject Re: Querying by foreign keys
Date Sat, 25 Apr 2009 19:40:38 GMT
On Sat, Apr 25, 2009 at 10:27:33AM -0700, Adam Wolff wrote:
> Hi list,
> I have a little query problem that would be easy to solve in SQL, but
> seems hard in CouchDB. I have data that looks like this:
> doc1: {
>    keys : [1,2,3],
>    data : ....
> }
> doc2: {
>    keys : [1,3,5],
>    data : ....
> }
> ...
> 
> I'd like a view of my documents that lets me filter on multiple keys,
> so a query of keys=[1] yields doc1,doc2, as does a query of
> keys=[1,3].

What do you mean by keys=[1,3]? Does it mean all documents which contains
either key 1 or 3, or all documents which contains both keys 1 and 3?

In the first case, you can POST to the view with {"keys":[1,3]}

In the latter case, the simplest answer is to fetch the views for key=1 and
key=3, and intersect them on the client. Actually you can be smarter than
that: fetch whichever result set is smaller (keys=1 or key=3), and then
filter the result set on the client to discard those entries which don't
have the other key.

(This is exactly what a SQL database would do, except it hides it from you
by doing it all server-side)

> In couch, the only way I can think of to truly index all the
> combinations of the keys is to emit the n! combinations of the keys,
> which seems like a bad idea. (Maybe it's not though?)

I think it would be (2^n)-1 rather than n!, since you can sort them first.

e.g. for 5 keys there would be 31 combinations:
1 x []           -- useless, ignore
5 x [a]
10 x [a b]       (a <= b)
10 x [a b c]     (a <= b <= c)
5 x [a b c d]
1 x [a b c d e]

Actually you can halve it again, by realising that if you have emitted
[1,2,3,4,5] then you don't need to emit any prefixes of this, e.g. [1,2,3,4]
or [1,2,3] or [1,2] or [1], because a startkey-endkey query can be used to
find them.

Mind you, exponential growth still ain't good :-) It might be acceptable if
you have a limited number of tags, and you really want an instantaneous
answer to your multi-tag query across millions of documents.

Other tradeoffs are possible. For example, it might make sense just to emit
all single tags plus all combinations of two tags; that would be (n^2+n)/2
combinations. Then your query can instantly find all docs matching two of
the tags requested, and if the user has asked for three or more, filter
those at the client. But you still have to choose which 2 tags to query on,
and it may well not be worth it in practice.

Regards,

Brian.

Mime
View raw message