incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Lehnardt <...@apache.org>
Subject Re: Working on Lucene
Date Fri, 21 Mar 2008 20:00:22 GMT
Heya Søren,
thanks for picking this up. Any help is greatly appreciated :)

On Mar 21, 2008, at 11:02 , Søren Hilmer wrote:
> I have hacked a little on the LuceneIndexer, and fixed some bugs to  
> get it
> compiling and running, though I also had to patch couchdb4j (patch  
> was from
> the couchdb4j issues page). Also found that the readme needs some  
> tuning,
> LuceneIndexer in couch.ini is now FullTextSearchQueryServer, right?

Nope, that is the LuceneSearcher. The LuceneIndexer is now
DbUpdateNotificationProcess.


> But all is still not well, here is a few of questions that I hope  
> someone can
> supply answers to:
>
> 1) couchdb4j uses "_all_docs_by_update_seq" to get a specific  
> revision of a
> document. The trunk version of couchdb does not support this. Has it  
> been
> discontinued in favor of "_all_docs_by_seq" ?

Yes.


> 2) What was actually the intention of the LuceneIndexer, I guess  
> that it
> should traverse all the databases and all the documents within these  
> and
> store the result in the database "couchdbfulltext", right? Some work  
> to
> achieve this seams necessary.

LuceneIndexer is supposed to create the fulltext index that  
LuceneSearcher
then can query. It is responsible for building and maintaining that  
index. That
is update and delete entries as needed. See below.


> 3)  When a database has a changed document, the indexer should re- 
> index it,
> right?

Correct. LuceneIndexer is launched along with and from CouchDB if you  
supply
the ini option I mentioned above. CouchDB opens a stdio connection with
LuceneIndexer. LuceneIndexer has on listen to stdin for messages CouchDB
sends. Now every time a database changes, CouchDB sends down the  
database
name followed by a newline to LuceneIndexer. CouchDB expects  
LuceneIndexer
NOT to answer.

The first time a change notification is sent, that is, when no index  
has been written,
LuceneIndexer fetches all documents from CouchDB and integrates their  
contents
into the search index. With that, LuceneIndexer maintains the update  
sequence
number of that database. So on all subsequent notifications,  
LuceneIndexer can
use that sequence number to ask only for the documents that changed  
since the
last time and in turn can then update the fulltext index accordingly.  
In practice you
would not fetch each doc individually but make sure you only query  
every N seconds
or only once for each M notifications.

Makes sense? :)

> 4) I have still not looked at the LuceneSearcher, how is that hooked  
> into
> couchdb?

In the same way with the ini option you mentioned in your mail. When  
CouchDB
gets a fulltext query, it sends the query string over stdio to the  
LuceneSearcher
along with the database name. LuceneSearcher returns a list of  
documents and
probabilities of all documents that match that query. CouchDB then  
returns this
list.

Note however, that there is no HTTP API to test that, only the  
internal API has that.
So you'd have to start CouchDB with the Erlang console (-i flag IIRC)  
and use
couchdb_ft_query:execute("database", "+ query +string"). to send CouchDB
fulltext queries.

> I hope to get it working and supply a patch when it does, hopefully  
> I am not
> treading on anyones toes here.

By no means! Please go ahead. We are grateful for every helping hand. If
you have any more questions, just send them in. You might want to check
out #couchdb on Freenode if you are into IRC.

Thanks for your help.

Cheers,
Jan
--
Mime
View raw message