Mailing-List: contact couchdb-user-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: couchdb-user@incubator.apache.org
Received-SPF: pass (athena.apache.org: domain of jan@prima.de designates
 83.97.50.139 as permitted sender)
Message-Id: <D43C3E2E-6630-403D-BC4B-85DF2AC6B48A@prima.de>
From: Jan Lehnardt <jan@prima.de>
To: couchdb-user@incubator.apache.org
In-Reply-To: <e282921e0803211455s2ae067fet5aecc94b0561b144@mail.gmail.com>
Content-Type: text/plain; charset=US-ASCII; format=flowed
Content-Transfer-Encoding: 7bit
Mime-Version: 1.0 (Apple Message framework v919.2)
Subject: Re: Sphinx integration (was: Working on Lucene)
Date: Fri, 21 Mar 2008 18:26:26 -0400
References: <e282921e0803211455s2ae067fet5aecc94b0561b144@mail.gmail.com>


On Mar 21, 2008, at 17:55 , Chris Anderson wrote:
> On Fri, Mar 21, 2008 at 1:34 PM, Jan Lehnardt <jan@apache.org> wrote:
>> Thanks for the input. This is actually an implementation detail of
>> the Indexer, but I agree that this should be supported. I also think
>> we should have some standard way here so other search solutions
>> can be plugged in without breaking things.
>>
>
> Jan,
>
> Some thoughts about Sphinx integration.
>
> The HTTP API as it currently stands (just the ability to page through
> an entire view) is sufficient to implement Sphinx indexing on views as
> an external process.
>
> However, Sphinx has the requirement that the documents it indexes each
> have a unique, numerical id. Using the CouchDB document ID would not
> be advised in that case. Using a map function the emits once per
> document (or using Reduce/Combine when it becomes available) coupled
> with a function to deterministically convert CouchDB document ids into
> integers should make for views which can be easily indexed by Sphinx.
>
> The map function might look like this
>
> function(doc) {
>  if (doc.title) {
>    map(docIDtoInteger(doc.id), doc.title);
>  }
> }
>
> It's too bad that Sphinx doesn't support arbitrary strings as document
> IDs, but I'm sure there are plenty of reversible string-to-integer
> mappings that could be used. In that case Sphinx would be queried and
> return a list of matching integers IDs, which could be mapped back to
> CouchDB document IDs, and then retrieved from the Couch.
>
> This thought experiment is encouraging because it shows that even
> without integration into CouchDB, some very useful custom full-text
> indexes could be created. AFAIK Sphinx's support for updating indexes
> is limited to merging new documents into the index, so it would have
> little use for an API to find view-rows which have been changed or
> removed. Luckily, index rebuild is lightning fast.

This all makes perfect sense to me.

We should come up with some "schema" (heh) that defines how
FT Indexers should behave. I am thinking of a special _design
document that sets various configuration variables for the indexer.

E.g. the views to use for indexing:

{
   "_id":"_design/fulltextsearch",
   "_rev":"123",
   "_fulltext_options": {
     "views": ["names", "cities"];
   }
}

where names and cities were the names of two views. The Indexer
then could maintain two separate fulltext indexes based on these
views. The HTTP API for querying could look like this:

http://server/database/_fulltext/names?query="+Me?er -Meyer"

This is not meant as a definitive RFC, but a starting point for
discussions. Please chime in :)

Cheers
Jan
--