couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Lehnardt <...@apache.org>
Subject The state of the fulltext search
Date Sat, 10 May 2008 18:56:10 GMT
Heya folks,
this mail is an introduction for Jun and Bo from IBM who
would like to contribute JSearch[1] to CouchDB. JSearch
sits on top of Lucene so this clearly affects our fulltext
search. All cheers to Jun and Bo I say! :-)

I'll summarise what the current state is and what is planned to
give a basis for discussion of how things could be integrated.

Fulltext search separates indexing and searching.

Indexing works like this: In couch.ini you specify a standalone
daemon with the DbUpdateNotificationProcess setting. This
daemon gets launched by CouchDB when it starts up. The
daemon is supposed to listen on stdin for notifications
from CouchDB.

Each time a database in CouchDB is changed, CouchDB sends
a JSON object over stdio to the notification daemon:
{"type": "updated", "db":"database_name"}\n
CouchDB expects no answer. The indexer can then do whatever
he wants, for example polling CouchDB for the latest changes and
save them into a fulltext index. The JSON structure might be
expanded in the future, but in a backwards compatible
manner (after 1.0, before 1.0 we might break everything :-).

On this end, I think it would be nice to have a set of scripts that
make it easy to register for events in all major languages so that
people don't have to reimplement the listening and polling parts
and concentrate on what they actually want to accomplish, but no
design or work went into this direction.


Searching works very similar in that a deamon listens on stdin
for commands from CouchDB. The protocol is a little more complex
here because it requires two-way communication.
CouchDB exposes the search part over the HTTP API. At the
moment you can call http://server:5984/database/_search?q="searchstring"
and CouchDB will send this to the searcher daemon:
database\n
searchstring\n
\n
The searcher is expected to answer either with:
error\n
reason\n
\n

or

ok\n
docid\n
score\n
docid\n
score\n
.
.
.
\n

And CouchDB takes this list and returns it wrapped in JSON back to the
caller.

This is the state but I'd like to see some changes:

I think we should move here from plaintext to JSON as well to gain a bit
more flexibility. The basic idea is that this mechanism is good for  
any kind
of indexing, not just fulltext. A friend of mine is already working on  
geo-
searching with this interface[2]. (In this light, I propose drop the  
"fulltext" or
"ft" label from the source for clarification).

So we could handle calls like http://server:5984/database/_search? 
q="query"&some_custom_arg=value&other_arg=othervalue and pass it
to the searcher API as:
{"db":"database", "args":[{"q":"query"}, {"some_custom_arg":'value"},  
{"other_arg":"other_value'}]}\n
\n
and expect back a JSON result as well: either in single chunks or one
huge object:

Chunks:
{"ok":"true"}\n (or {"error":"reason"`}\n\n)
{"id":"docid", "score":"score"}\n
{"id":"docid", "score":"score"}\n
{"id":"docid", "score":"score"}\n
...
\n

Huge:
{"ok":"true", result: [
{"id":"docid", "score":"score"},
{"id":"docid", "score":"score"},
{"id":"docid", "score":"score"},
]}\n
\n

This would allow us to enable searchers to add custom values to the  
results
and have CouchDB just add them transparently to the result set (like  
with the
transparent handling of additional HTTP query arguments).

All of those changes are just to explain the direction I wish to see  
this go in,
no very well thought out proposals. I really appreciate your feedback  
and
input here.

I think we do have a halfway working indexer and searcher written for  
Java
Lucene. I wrote some code for that a year ago and somebody (please  
step up!)
improved that to work on the current CouchDB. But this certainly could  
use some
work and any contributions here are very welcome (read: I don't want  
to do it).

One more future direction that was discussed inconclusively before was  
the
fulltext indexing of views. The general consensus was that we want to  
have it,
but haven't figured out a good way to actually implement it. The  
mailing list
archives have some valuable posts on that.

So this is the current state. Now it's your turn :-)  How would  
JSearch fit into
all this? I'm happy to help with any integration questions and  
suggestions for
improvements on the CouchDB side, but I'd prefer not to have to deal  
with
the Java side of things.

Oh, and one more point Noah Slater brought up in IRC: Adding Java as a
default requirement to CouchDB is quite heavy. And we need to discuss
how this is supposed to be packaged and distributed with CouchDB.

Cheers
Jan
--

[1] I could swear there was a website but I can't find it anymore.
So Jun an Bo, could you introduce JSearch to the others here?

[2] http://vmx.cx/cgi-bin/blog/index.cgi

Mime
View raw message