couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aurélien Bénel <>
Subject Re: Question on selecting on reduce values
Date Fri, 28 May 2010 17:02:30 GMT
Thanks for your answer,

> It seems that you're using a _list function to filter your view results, right? 
> Be aware that even though you're not sending that data to the client, the database still
has to iterate thru all the view rows and send them to the _list function, just to get filtered
there. So the amount of time it takes to query your view/list will increase proportionally
with the number rows returned from the view query.

Yes. This is indeed why I am sceptic about this way of selecting reduce values.

In our project, we try to move our open-source text analysis software from PHP/PostgreSQL
to CouchDB.
The current issue is about getting repeated phrases (sequences of 3 words) in forums. 

Each forum thread is stored as a CouchDB "document".

A view emits every sequence that match different constraints :

function(doc) {
 const ALPHA = /[a-zàâçéêèëïîôöüùû0-9]+|[^a-zàâçéêèëïîôöüùû0-9]+/gi;
 for each (p in doc.posts) {
   var words = p.text.match(ALPHA);
   for (i=0; i<words.length-4; i+=2) {
     if (
       (words[i].length>3 || words[i+2].length>3 || words[i+4].length>3)
       && words[i+1].length==1
       && words[i+3].length==1
     ) {
       ], null);

Then a reduce is done to count occurrences on the whole corpus :

function(keys, values, combine) {
 if (combine) {
   return sum(values);
 } else {
   return values.length;

Then a list filters out unrepeated phrases :

function(head, req) {
 var phrase;
 while (phrase = getRow()) {
   if (phrase.value>1) { // is repeated

I know that the view could be done differently and probably more efficiently with regular
expressions, but my worry is not on the performance of the first generation of views (that
was what I meant by "cached"), but every time I query the list. 


View raw message