From couchdb-dev-return-370-apmail-incubator-couchdb-dev-archive=incubator.apache.org@incubator.apache.org Mon May 12 21:08:34 2008 Return-Path: Delivered-To: apmail-incubator-couchdb-dev-archive@locus.apache.org Received: (qmail 70435 invoked from network); 12 May 2008 21:08:34 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 12 May 2008 21:08:34 -0000 Received: (qmail 58516 invoked by uid 500); 12 May 2008 21:08:35 -0000 Delivered-To: apmail-incubator-couchdb-dev-archive@incubator.apache.org Received: (qmail 58492 invoked by uid 500); 12 May 2008 21:08:35 -0000 Mailing-List: contact couchdb-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: couchdb-dev@incubator.apache.org Delivered-To: mailing list couchdb-dev@incubator.apache.org Received: (qmail 58481 invoked by uid 99); 12 May 2008 21:08:35 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 12 May 2008 14:08:35 -0700 X-ASF-Spam-Status: No, hits=2.7 required=10.0 tests=SPF_NEUTRAL,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [195.41.46.235] (HELO pfepa.post.tele.dk) (195.41.46.235) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 12 May 2008 21:07:38 +0000 Received: from pascal.widetrail.dk (0x503ed345.arcnxx11.adsl-dhcp.tele.dk [80.62.211.69]) by pfepa.post.tele.dk (Postfix) with ESMTP id 8F611FAC024 for ; Mon, 12 May 2008 23:07:12 +0200 (CEST) Received: from localhost (localhost [127.0.0.1]) by pascal.widetrail.dk (Postfix) with ESMTP id E29443FAC8 for ; Mon, 12 May 2008 23:14:31 +0200 (CEST) Received: from pascal.widetrail.dk ([127.0.0.1]) by localhost (pascal.widetrail.dk [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 10963-03 for ; Mon, 12 May 2008 23:14:22 +0200 (CEST) Received: from leibniz.widetrail (unknown [10.10.1.42]) by pascal.widetrail.dk (Postfix) with ESMTP id 73DC23FA23 for ; Mon, 12 May 2008 23:14:22 +0200 (CEST) From: =?iso-8859-1?q?S=F8ren_Hilmer?= Organization: wideTrail To: couchdb-dev@incubator.apache.org Subject: Re: The state of the fulltext search Date: Mon, 12 May 2008 23:10:03 +0200 User-Agent: KMail/1.9.5 References: In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline Message-Id: <200805122310.03930.sh@widetrail.dk> X-Virus-Scanned: amavisd-new at widetrail.dk X-Virus-Checked: Checked by ClamAV on apache.org Hi As you know. I did some work on the ideas of indexing views, but as Jan=20 explains, there was no good way of doing it. I do not believe that the=20 experiments has significant value, if the goal now is to move to JSearch an= d=20 do Document indexing instead of Views indexing. I did post a few patches for couchdb4j, which has made it into trunk=20 (couchdb4j trunk that is) and thus makes it easier to get the existing=20 codebase up and running. Couchdb4j, still has some issues with handling Vie= ws=20 correctly (it seams to be coded around an old couchdb model, where Views ar= e=20 tied to Documents). =46rom Jun's intro to JSearch, that seams to be a nice fit for couchdb. I a= m=20 looking forward to trying it out. Have fun S=F8ren On Saturday 10 May 2008 20:56, Jan Lehnardt wrote: > Heya folks, > this mail is an introduction for Jun and Bo from IBM who > would like to contribute JSearch[1] to CouchDB. JSearch > sits on top of Lucene so this clearly affects our fulltext > search. All cheers to Jun and Bo I say! :-) > > I'll summarise what the current state is and what is planned to > give a basis for discussion of how things could be integrated. > > Fulltext search separates indexing and searching. > > Indexing works like this: In couch.ini you specify a standalone > daemon with the DbUpdateNotificationProcess setting. This > daemon gets launched by CouchDB when it starts up. The > daemon is supposed to listen on stdin for notifications > from CouchDB. > > Each time a database in CouchDB is changed, CouchDB sends > a JSON object over stdio to the notification daemon: > {"type": "updated", "db":"database_name"}\n > CouchDB expects no answer. The indexer can then do whatever > he wants, for example polling CouchDB for the latest changes and > save them into a fulltext index. The JSON structure might be > expanded in the future, but in a backwards compatible > manner (after 1.0, before 1.0 we might break everything :-). > > On this end, I think it would be nice to have a set of scripts that > make it easy to register for events in all major languages so that > people don't have to reimplement the listening and polling parts > and concentrate on what they actually want to accomplish, but no > design or work went into this direction. > > > Searching works very similar in that a deamon listens on stdin > for commands from CouchDB. The protocol is a little more complex > here because it requires two-way communication. > CouchDB exposes the search part over the HTTP API. At the > moment you can call http://server:5984/database/_search?q=3D"searchstring" > and CouchDB will send this to the searcher daemon: > database\n > searchstring\n > \n > The searcher is expected to answer either with: > error\n > reason\n > \n > > or > > ok\n > docid\n > score\n > docid\n > score\n > . > . > . > \n > > And CouchDB takes this list and returns it wrapped in JSON back to the > caller. > > This is the state but I'd like to see some changes: > > I think we should move here from plaintext to JSON as well to gain a bit > more flexibility. The basic idea is that this mechanism is good for > any kind > of indexing, not just fulltext. A friend of mine is already working on > geo- > searching with this interface[2]. (In this light, I propose drop the > "fulltext" or > "ft" label from the source for clarification). > > So we could handle calls like http://server:5984/database/_search? > q=3D"query"&some_custom_arg=3Dvalue&other_arg=3Dothervalue and pass it > to the searcher API as: > {"db":"database", "args":[{"q":"query"}, {"some_custom_arg":'value"}, > {"other_arg":"other_value'}]}\n > \n > and expect back a JSON result as well: either in single chunks or one > huge object: > > Chunks: > {"ok":"true"}\n (or {"error":"reason"`}\n\n) > {"id":"docid", "score":"score"}\n > {"id":"docid", "score":"score"}\n > {"id":"docid", "score":"score"}\n > ... > \n > > Huge: > {"ok":"true", result: [ > {"id":"docid", "score":"score"}, > {"id":"docid", "score":"score"}, > {"id":"docid", "score":"score"}, > ]}\n > \n > > This would allow us to enable searchers to add custom values to the > results > and have CouchDB just add them transparently to the result set (like > with the > transparent handling of additional HTTP query arguments). > > All of those changes are just to explain the direction I wish to see > this go in, > no very well thought out proposals. I really appreciate your feedback > and > input here. > > I think we do have a halfway working indexer and searcher written for > Java > Lucene. I wrote some code for that a year ago and somebody (please > step up!) > improved that to work on the current CouchDB. But this certainly could > use some > work and any contributions here are very welcome (read: I don't want > to do it). > > One more future direction that was discussed inconclusively before was > the > fulltext indexing of views. The general consensus was that we want to > have it, > but haven't figured out a good way to actually implement it. The > mailing list > archives have some valuable posts on that. > > So this is the current state. Now it's your turn :-) How would > JSearch fit into > all this? I'm happy to help with any integration questions and > suggestions for > improvements on the CouchDB side, but I'd prefer not to have to deal > with > the Java side of things. > > Oh, and one more point Noah Slater brought up in IRC: Adding Java as a > default requirement to CouchDB is quite heavy. And we need to discuss > how this is supposed to be packaged and distributed with CouchDB. > > Cheers > Jan > -- > > [1] I could swear there was a website but I can't find it anymore. > So Jun an Bo, could you introduce JSearch to the others here? > > [2] http://vmx.cx/cgi-bin/blog/index.cgi =2D-=20 S=F8ren Hilmer, M.Sc., M.Crypt. wideTrail Phone: +45 25481225 Pilev=E6nget 41 Email: sh@widetrail.dk DK-8961 Alling=E5bro Web: www.widetrail.dk