Return-Path: Delivered-To: apmail-incubator-couchdb-dev-archive@locus.apache.org Received: (qmail 51446 invoked from network); 10 May 2008 20:06:36 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 10 May 2008 20:06:36 -0000 Received: (qmail 90145 invoked by uid 500); 10 May 2008 20:06:37 -0000 Delivered-To: apmail-incubator-couchdb-dev-archive@incubator.apache.org Received: (qmail 90116 invoked by uid 500); 10 May 2008 20:06:37 -0000 Mailing-List: contact couchdb-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: couchdb-dev@incubator.apache.org Delivered-To: mailing list couchdb-dev@incubator.apache.org Received: (qmail 90105 invoked by uid 99); 10 May 2008 20:06:37 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 10 May 2008 13:06:37 -0700 X-ASF-Spam-Status: No, hits=1.5 required=10.0 tests=SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of matt.goodall@gmail.com designates 209.85.200.174 as permitted sender) Received: from [209.85.200.174] (HELO wf-out-1314.google.com) (209.85.200.174) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 10 May 2008 20:05:52 +0000 Received: by wf-out-1314.google.com with SMTP id 28so2308741wff.21 for ; Sat, 10 May 2008 13:06:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; bh=7MBjK+6Re18yu659irHvoXr7XxTmCyfhiEcS4xkFgvw=; b=LeLtKWyANGNTbZX9vTOtJnadaxPkr11OeSRuHF2E3ky9LyTLQsCICb4fJ5zh27DnOazsQzSX40wLlOCuU638WhRlF8ZqqV6tgKyeyBwzHmj14xsGMnzd0LyrZWolicWkmYMae1VgFUq1zu+WhB98kvkhaSjZ9kVD4lAmRj9hgnk= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=tWFZ5c5vrB5DYryeqO3V2R9nRAmRDQWJZtpcvGPP8uW/p6IZQue1JkNmNmM5GX44rvyjjOkTB2leT2MaUG8J9oO4+A/cxsUW3H32MLfvXc+cINbpAt6k4V/F+RV8dcx2JKmE54HQsRS0RXd81CAH1XgF5jrPaQG86VGW3vgBvf8= Received: by 10.142.71.6 with SMTP id t6mr2548657wfa.331.1210449966910; Sat, 10 May 2008 13:06:06 -0700 (PDT) Received: by 10.142.102.10 with HTTP; Sat, 10 May 2008 13:06:06 -0700 (PDT) Message-ID: <214c385b0805101306s4b67e7f0mc858f395dd90ba62@mail.gmail.com> Date: Sat, 10 May 2008 21:06:06 +0100 From: "Matt Goodall" To: couchdb-dev@incubator.apache.org Subject: Re: The state of the fulltext search In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Content-Disposition: inline References: X-Virus-Checked: Checked by ClamAV on apache.org 2008/5/10 Jan Lehnardt : > Heya folks, > this mail is an introduction for Jun and Bo from IBM who > would like to contribute JSearch[1] to CouchDB. JSearch > sits on top of Lucene so this clearly affects our fulltext > search. All cheers to Jun and Bo I say! :-) > > I'll summarise what the current state is and what is planned to > give a basis for discussion of how things could be integrated. > > Fulltext search separates indexing and searching. > > Indexing works like this: In couch.ini you specify a standalone > daemon with the DbUpdateNotificationProcess setting. This > daemon gets launched by CouchDB when it starts up. The > daemon is supposed to listen on stdin for notifications > from CouchDB. > > Each time a database in CouchDB is changed, CouchDB sends > a JSON object over stdio to the notification daemon: > {"type": "updated", "db":"database_name"}\n > CouchDB expects no answer. The indexer can then do whatever > he wants, for example polling CouchDB for the latest changes and > save them into a fulltext index. The JSON structure might be > expanded in the future, but in a backwards compatible > manner (after 1.0, before 1.0 we might break everything :-). It also, unless it's changed recently, sends "type" values of "created" and "deleted" when a database is created and deleted respectively. However, I don't really think they're that useful. > > On this end, I think it would be nice to have a set of scripts that > make it easy to register for events in all major languages so that > people don't have to reimplement the listening and polling parts > and concentrate on what they actually want to accomplish, but no > design or work went into this direction. Hmm, I had a similar thought. I started writing a DbUpdateNotificationProcess to connect CouchDB to a hyperestraier search index. Similar to Jan's "set of scripts" idea, it occurred to me that the bit that talks to hyperestraier could easily be replaced to connect to some other indexing engine, via a port if necessary. > > > Searching works very similar in that a deamon listens on stdin > for commands from CouchDB. The protocol is a little more complex > here because it requires two-way communication. > CouchDB exposes the search part over the HTTP API. At the > moment you can call http://server:5984/database/_search?q="searchstring" > and CouchDB will send this to the searcher daemon: > database\n > searchstring\n > \n > The searcher is expected to answer either with: > error\n > reason\n > \n > > or > > ok\n > docid\n > score\n > docid\n > score\n > . > . > . > \n > > And CouchDB takes this list and returns it wrapped in JSON back to the > caller. > > This is the state but I'd like to see some changes: > > I think we should move here from plaintext to JSON as well to gain a bit > more flexibility. The basic idea is that this mechanism is good for any kind > of indexing, not just fulltext. A friend of mine is already working on geo- > searching with this interface[2]. (In this light, I propose drop the > "fulltext" or > "ft" label from the source for clarification). > > So we could handle calls like > http://server:5984/database/_search?q="query"&some_custom_arg=value&other_arg=othervalue > and pass it > to the searcher API as: > {"db":"database", "args":[{"q":"query"}, {"some_custom_arg":'value"}, > {"other_arg":"other_value'}]}\n > \n > and expect back a JSON result as well: either in single chunks or one > huge object: > > Chunks: > {"ok":"true"}\n (or {"error":"reason"`}\n\n) > {"id":"docid", "score":"score"}\n > {"id":"docid", "score":"score"}\n > {"id":"docid", "score":"score"}\n > ... > \n > > Huge: > {"ok":"true", result: [ > {"id":"docid", "score":"score"}, > {"id":"docid", "score":"score"}, > {"id":"docid", "score":"score"}, > ]}\n > \n > > This would allow us to enable searchers to add custom values to the results > and have CouchDB just add them transparently to the result set (like with > the > transparent handling of additional HTTP query arguments). JSON definitely sounds like a good idea. Most (all?) search engines return the matching block of text as part of the results. It would surely be a good idea to allow applications to make use of that text if possible instead of hitting CouchDB multiple times to get the real document. > > All of those changes are just to explain the direction I wish to see this go > in, > no very well thought out proposals. I really appreciate your feedback and > input here. > > I think we do have a halfway working indexer and searcher written for Java > Lucene. I wrote some code for that a year ago and somebody (please step up!) > improved that to work on the current CouchDB. But this certainly could use > some > work and any contributions here are very welcome (read: I don't want to do > it). > > One more future direction that was discussed inconclusively before was the > fulltext indexing of views. The general consensus was that we want to have > it, > but haven't figured out a good way to actually implement it. The mailing > list > archives have some valuable posts on that. > > So this is the current state. Now it's your turn :-) How would JSearch fit > into > all this? I'm happy to help with any integration questions and suggestions > for > improvements on the CouchDB side, but I'd prefer not to have to deal with > the Java side of things. > > Oh, and one more point Noah Slater brought up in IRC: Adding Java as a > default requirement to CouchDB is quite heavy. And we need to discuss > how this is supposed to be packaged and distributed with CouchDB. Yep, I wrote my original Hyperetraier connector in Python (it's what I use every day) but decided to rewrite in Erlang. Partly to apply my (somewhat theoretical) knowledge of Erlang to something real, but also to avoid any dependency on tools (Python, in this case) that might not be wanted. I do use Java (and Ruby, and Perl, and ...) but definitely only when I have to. - Matt