couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From J Chris Anderson <jch...@gmail.com>
Subject Re: Question on selecting on reduce values
Date Fri, 28 May 2010 21:57:55 GMT

On May 28, 2010, at 10:02 AM, Aurélien Bénel wrote:

> Thanks for your answer,
> 
>> It seems that you're using a _list function to filter your view results, right? 
>> Be aware that even though you're not sending that data to the client, the database
still has to iterate thru all the view rows and send them to the _list function, just to get
filtered there. So the amount of time it takes to query your view/list will increase proportionally
with the number rows returned from the view query.
> 
> Yes. This is indeed why I am sceptic about this way of selecting reduce values.
> 
> In our project, we try to move our open-source text analysis software from PHP/PostgreSQL
to CouchDB.
> The current issue is about getting repeated phrases (sequences of 3 words) in forums.

> 
> Each forum thread is stored as a CouchDB "document".
> 
> A view emits every sequence that match different constraints :
> 
> function(doc) {
> const ALPHA = /[a-zàâçéêèëïîôöüùû0-9]+|[^a-zàâçéêèëïîôöüùû0-9]+/gi;
> for each (p in doc.posts) {
>   var words = p.text.match(ALPHA);
>   for (i=0; i<words.length-4; i+=2) {
>     if (
>       (words[i].length>3 || words[i+2].length>3 || words[i+4].length>3)
>       && words[i+1].length==1
>       && words[i+3].length==1
>     ) {
>       emit([
>         words[i].toLowerCase(),
>         words[i+2].toLowerCase(),
>         words[i+4].toLowerCase()
>       ], null);
>     }
>   }
> }
> }
> 
> Then a reduce is done to count occurrences on the whole corpus :
> 
> function(keys, values, combine) {
> if (combine) {
>   return sum(values);
> } else {
>   return values.length;
> }
> }
> 

try replacing the reduce function with the single word string "_count" (Without the quotes)

this will do it in Erlang, and should speed things up a lot. please let us know what kind
of difference this makes.

> Then a list filters out unrepeated phrases :
> 
> function(head, req) {
> var phrase;
> send('{"rows":[\n');
> while (phrase = getRow()) {
>   if (phrase.value>1) { // is repeated
>       send(JSON.stringify(phrase));
>       send(',\n');
>   }
> }  
> send(']}');
> }
> 
> 
> I know that the view could be done differently and probably more efficiently with regular
expressions, but my worry is not on the performance of the first generation of views (that
was what I meant by "cached"), but every time I query the list. 
> 
> 
> Regards,
> 
> Aurélien


Mime
View raw message