Return-Path: Delivered-To: apmail-incubator-couchdb-user-archive@locus.apache.org Received: (qmail 936 invoked from network); 21 Mar 2008 22:27:02 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 21 Mar 2008 22:27:02 -0000 Received: (qmail 77339 invoked by uid 500); 21 Mar 2008 22:26:59 -0000 Delivered-To: apmail-incubator-couchdb-user-archive@incubator.apache.org Received: (qmail 77312 invoked by uid 500); 21 Mar 2008 22:26:59 -0000 Mailing-List: contact couchdb-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: couchdb-user@incubator.apache.org Delivered-To: mailing list couchdb-user@incubator.apache.org Received: (qmail 77303 invoked by uid 99); 21 Mar 2008 22:26:59 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 21 Mar 2008 15:26:59 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jan@prima.de designates 83.97.50.139 as permitted sender) Received: from [83.97.50.139] (HELO jan.prima.de) (83.97.50.139) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 21 Mar 2008 22:26:20 +0000 Received: from [10.0.1.198] (cpe-071-068-049-063.carolina.res.rr.com [::ffff:71.68.49.63]) (AUTH: LOGIN jan, SSL: TLSv1/SSLv3,128bits,AES128-SHA) by jan.prima.de with esmtp; Fri, 21 Mar 2008 22:26:29 +0000 Message-Id: From: Jan Lehnardt To: couchdb-user@incubator.apache.org In-Reply-To: Content-Type: text/plain; charset=US-ASCII; format=flowed Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v919.2) Subject: Re: Sphinx integration (was: Working on Lucene) Date: Fri, 21 Mar 2008 18:26:26 -0400 References: X-Mailer: Apple Mail (2.919.2) X-Virus-Checked: Checked by ClamAV on apache.org On Mar 21, 2008, at 17:55 , Chris Anderson wrote: > On Fri, Mar 21, 2008 at 1:34 PM, Jan Lehnardt wrote: >> Thanks for the input. This is actually an implementation detail of >> the Indexer, but I agree that this should be supported. I also think >> we should have some standard way here so other search solutions >> can be plugged in without breaking things. >> > > Jan, > > Some thoughts about Sphinx integration. > > The HTTP API as it currently stands (just the ability to page through > an entire view) is sufficient to implement Sphinx indexing on views as > an external process. > > However, Sphinx has the requirement that the documents it indexes each > have a unique, numerical id. Using the CouchDB document ID would not > be advised in that case. Using a map function the emits once per > document (or using Reduce/Combine when it becomes available) coupled > with a function to deterministically convert CouchDB document ids into > integers should make for views which can be easily indexed by Sphinx. > > The map function might look like this > > function(doc) { > if (doc.title) { > map(docIDtoInteger(doc.id), doc.title); > } > } > > It's too bad that Sphinx doesn't support arbitrary strings as document > IDs, but I'm sure there are plenty of reversible string-to-integer > mappings that could be used. In that case Sphinx would be queried and > return a list of matching integers IDs, which could be mapped back to > CouchDB document IDs, and then retrieved from the Couch. > > This thought experiment is encouraging because it shows that even > without integration into CouchDB, some very useful custom full-text > indexes could be created. AFAIK Sphinx's support for updating indexes > is limited to merging new documents into the index, so it would have > little use for an API to find view-rows which have been changed or > removed. Luckily, index rebuild is lightning fast. This all makes perfect sense to me. We should come up with some "schema" (heh) that defines how FT Indexers should behave. I am thinking of a special _design document that sets various configuration variables for the indexer. E.g. the views to use for indexing: { "_id":"_design/fulltextsearch", "_rev":"123", "_fulltext_options": { "views": ["names", "cities"]; } } where names and cities were the names of two views. The Indexer then could maintain two separate fulltext indexes based on these views. The HTTP API for querying could look like this: http://server/database/_fulltext/names?query="+Me?er -Meyer" This is not meant as a definitive RFC, but a starting point for discussions. Please chime in :) Cheers Jan --