couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Damien Katz <dam...@apache.org>
Subject Re: Multiple filters on a large data set
Date Fri, 26 Sep 2008 19:37:32 GMT
Your requirements as stated would be well met by a something like  
Lucene.

However, another possible way to go about this is to permute the key  
sets into key arrays and emit each. The number of keys would normally  
be (N!)/2, where N is the number of fields you are indexing. However,  
we can use views collation to do range lookups, allows us to ignore  
the different array key suffixes. That would reduce the number of key  
arrays emitted per document to 2^N. If each document has 10 fields,  
then the number of permutations would be 2^10 or 1024 keys emitted per  
doc.

To build that index for 50000 documents would take an on-disk view  
index of 50,000,000 rows. Building it will take a very long time and  
it will take a lot of disk space. But once built, it should then  
possible to do the categorized, drill down searches, that can show you  
relevant sub-categories and their counts to further narrow down  
search, and do so pretty efficiently. This is very much the kind of  
stuff like Endeca does for online retailers.

I don't know if CouchDB views are up to it yet, but it might be worth  
experimenting.

-Damien


On Sep 26, 2008, at 2:11 PM, Paul Davis wrote:

>> code. This feels to me like something a database should take care of,
>> and might become problematic when you have your webpage code talk  
>> with
>> couchdb directly.
>
> Be very wary of yourself when you think such things. Generally its a
> sign (at least for me) that you're not realizing how deeply your SQL
> brainwashing runs. And generally when I get to this point if I just
> step back I realize there's probably a decent way to do it with couch.
>
> Though, in this particular case you have come to a somewhat lacking
> area of couch in its ability to handle dynamic queries as such.
>
> And, just a thought, whenever multiget and include_docs lands, you'd
> be able to do this pretty easily as:
>
> get set of documents for tag
> for rest of tags:
>    multiget set of documents where other tag
>
> it'd be an iterative weeding out of docs.
>
> At least. I think that'd work....
>
> Paul


Mime
View raw message