incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Lehnardt <...@apache.org>
Subject Re: Sphinx integration (was: Working on Lucene)
Date Fri, 21 Mar 2008 22:31:08 GMT
Sorry for the dupe and please ignore the typos :)

Cheers
Jan
--
On Mar 21, 2008, at 18:26 , Jan Lehnardt wrote:
>
> On Mar 21, 2008, at 17:55 , Chris Anderson wrote:
>> On Fri, Mar 21, 2008 at 1:34 PM, Jan Lehnardt <jan@apache.org> wrote:
>>> Thanks for the input. This is actually an implementation detail of
>>> the Indexer, but I agree that this should be supported. I also think
>>> we should have some standard way here so other search solutions
>>> can be plugged in without breaking things.
>>>
>>
>> Jan,
>>
>> Some thoughts about Sphinx integration.
>>
>> The HTTP API as it currently stands (just the ability to page through
>> an entire view) is sufficient to implement Sphinx indexing on views  
>> as
>> an external process.
>>
>> However, Sphinx has the requirement that the documents it indexes  
>> each
>> have a unique, numerical id. Using the CouchDB document ID would not
>> be advised in that case. Using a map function the emits once per
>> document (or using Reduce/Combine when it becomes available) coupled
>> with a function to deterministically convert CouchDB document ids  
>> into
>> integers should make for views which can be easily indexed by Sphinx.
>>
>> The map function might look like this
>>
>> function(doc) {
>> if (doc.title) {
>>  map(docIDtoInteger(doc.id), doc.title);
>> }
>> }
>>
>> It's too bad that Sphinx doesn't support arbitrary strings as  
>> document
>> IDs, but I'm sure there are plenty of reversible string-to-integer
>> mappings that could be used. In that case Sphinx would be queried and
>> return a list of matching integers IDs, which could be mapped back to
>> CouchDB document IDs, and then retrieved from the Couch.
>>
>> This thought experiment is encouraging because it shows that even
>> without integration into CouchDB, some very useful custom full-text
>> indexes could be created. AFAIK Sphinx's support for updating indexes
>> is limited to merging new documents into the index, so it would have
>> little use for an API to find view-rows which have been changed or
>> removed. Luckily, index rebuild is lightning fast.
>
> This all makes perfect sense to me.
>
> We should come up with some "schema" (heh) that defines how
> FT Indexers should behave. I am thinking of a special _design
> document that sets various configuration variables for the indexer.
>
> E.g. the views to use for indexing:
>
> {
> "_id":"_design/fulltextsearch",
> "_rev":"123",
> "_fulltext_options": {
>   "views": ["names", "cities"];
> }
> }
>
> where names and cities were the names of two views. The Indexer
> then could maintain two separate fulltext indexes based on these
> views. The HTTP API for querying could look like this:
>
> http://server/database/_fulltext/names?query="+Me?er -Meyer"
>
> This is not meant as a definitive RFC, but a starting point for
> discussions. Please chime in :)
>
> Cheers
> Jan
> --
>


Mime
View raw message