From user-return-10730-apmail-couchdb-user-archive=couchdb.apache.org@couchdb.apache.org Fri May 28 17:03:10 2010 Return-Path: Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: (qmail 70361 invoked from network); 28 May 2010 17:03:10 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 28 May 2010 17:03:10 -0000 Received: (qmail 70944 invoked by uid 500); 28 May 2010 17:03:08 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 70895 invoked by uid 500); 28 May 2010 17:03:08 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 70887 invoked by uid 99); 28 May 2010 17:03:08 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 28 May 2010 17:03:08 +0000 X-ASF-Spam-Status: No, hits=0.7 required=10.0 tests=RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [212.27.42.3] (HELO smtp3-g21.free.fr) (212.27.42.3) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 28 May 2010 17:03:00 +0000 Received: from [192.168.0.3] (eur10-1-82-241-180-211.fbx.proxad.net [82.241.180.211]) by smtp3-g21.free.fr (Postfix) with ESMTP id 087E0818111 for ; Fri, 28 May 2010 19:02:31 +0200 (CEST) Content-Type: text/plain; charset=iso-8859-1 Mime-Version: 1.0 (Apple Message framework v1078) Subject: Re: Question on selecting on reduce values From: =?iso-8859-1?Q?Aur=E9lien_B=E9nel?= In-Reply-To: Date: Fri, 28 May 2010 19:02:30 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: References: <3D7954B0-4F5C-4E4F-895E-0DF1FBEB821E@utt.fr> To: user@couchdb.apache.org X-Mailer: Apple Mail (2.1078) Thanks for your answer, > It seems that you're using a _list function to filter your view = results, right?=20 > Be aware that even though you're not sending that data to the client, = the database still has to iterate thru all the view rows and send them = to the _list function, just to get filtered there. So the amount of time = it takes to query your view/list will increase proportionally with the = number rows returned from the view query. Yes. This is indeed why I am sceptic about this way of selecting reduce = values. In our project, we try to move our open-source text analysis software = from PHP/PostgreSQL to CouchDB. The current issue is about getting repeated phrases (sequences of 3 = words) in forums.=20 Each forum thread is stored as a CouchDB "document". A view emits every sequence that match different constraints : function(doc) { const ALPHA =3D /[a-z=E0=E2=E7=E9=EA=E8=EB=EF=EE=F4=F6=FC=F9=FB0-9]+|[^a-= z=E0=E2=E7=E9=EA=E8=EB=EF=EE=F4=F6=FC=F9=FB0-9]+/gi; for each (p in doc.posts) { var words =3D p.text.match(ALPHA); for (i=3D0; i3 || words[i+2].length>3 || words[i+4].length>3) && words[i+1].length=3D=3D1 && words[i+3].length=3D=3D1 ) { emit([ words[i].toLowerCase(), words[i+2].toLowerCase(), words[i+4].toLowerCase() ], null); } } } } Then a reduce is done to count occurrences on the whole corpus : function(keys, values, combine) { if (combine) { return sum(values); } else { return values.length; } } Then a list filters out unrepeated phrases : function(head, req) { var phrase; send('{"rows":[\n'); while (phrase =3D getRow()) { if (phrase.value>1) { // is repeated send(JSON.stringify(phrase)); send(',\n'); } } =20 send(']}'); } I know that the view could be done differently and probably more = efficiently with regular expressions, but my worry is not on the = performance of the first generation of views (that was what I meant by = "cached"), but every time I query the list.=20 Regards, Aur=E9lien=