incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sebastian Cohnen <sebastiancoh...@googlemail.com>
Subject Re: Question on selecting on reduce values
Date Sat, 29 May 2010 06:32:49 GMT
just wanted to add http://wiki.apache.org/couchdb/Built-In_Reduce_Functions

:)

On 28.05.2010, at 23:57, J Chris Anderson wrote:

> 
> On May 28, 2010, at 10:02 AM, Aurélien Bénel wrote:
> 
>> Thanks for your answer,
>> 
>>> It seems that you're using a _list function to filter your view results, right?

>>> Be aware that even though you're not sending that data to the client, the database
still has to iterate thru all the view rows and send them to the _list function, just to get
filtered there. So the amount of time it takes to query your view/list will increase proportionally
with the number rows returned from the view query.
>> 
>> Yes. This is indeed why I am sceptic about this way of selecting reduce values.
>> 
>> In our project, we try to move our open-source text analysis software from PHP/PostgreSQL
to CouchDB.
>> The current issue is about getting repeated phrases (sequences of 3 words) in forums.

>> 
>> Each forum thread is stored as a CouchDB "document".
>> 
>> A view emits every sequence that match different constraints :
>> 
>> function(doc) {
>> const ALPHA = /[a-zàâçéêèëïîôöüùû0-9]+|[^a-zàâçéêèëïîôöüùû0-9]+/gi;
>> for each (p in doc.posts) {
>>  var words = p.text.match(ALPHA);
>>  for (i=0; i<words.length-4; i+=2) {
>>    if (
>>      (words[i].length>3 || words[i+2].length>3 || words[i+4].length>3)
>>      && words[i+1].length==1
>>      && words[i+3].length==1
>>    ) {
>>      emit([
>>        words[i].toLowerCase(),
>>        words[i+2].toLowerCase(),
>>        words[i+4].toLowerCase()
>>      ], null);
>>    }
>>  }
>> }
>> }
>> 
>> Then a reduce is done to count occurrences on the whole corpus :
>> 
>> function(keys, values, combine) {
>> if (combine) {
>>  return sum(values);
>> } else {
>>  return values.length;
>> }
>> }
>> 
> 
> try replacing the reduce function with the single word string "_count" (Without the quotes)
> 
> this will do it in Erlang, and should speed things up a lot. please let us know what
kind of difference this makes.
> 
>> Then a list filters out unrepeated phrases :
>> 
>> function(head, req) {
>> var phrase;
>> send('{"rows":[\n');
>> while (phrase = getRow()) {
>>  if (phrase.value>1) { // is repeated
>>      send(JSON.stringify(phrase));
>>      send(',\n');
>>  }
>> }  
>> send(']}');
>> }
>> 
>> 
>> I know that the view could be done differently and probably more efficiently with
regular expressions, but my worry is not on the performance of the first generation of views
(that was what I meant by "cached"), but every time I query the list. 
>> 
>> 
>> Regards,
>> 
>> Aurélien
> 


Mime
View raw message