couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jun Rao <jun...@almaden.ibm.com>
Subject Re: The state of the fulltext search
Date Mon, 12 May 2008 01:16:10 GMT
***********************
Warning: Your file, jsearch.tgz, contains more than 32 files after decompression and cannot
be scanned.
***********************


Jan,

Thanks for the introduction.

JSearch is a prototype that we developed for indexing and searching Json
documents, and we are enthusiastic about contributing it to CouchDB.
JSearch
converts a given Json document to a Lucene document for indexing. The
conversion is lossless and preserves all structural information in the
original Json document. We achieve that by storing the encoding of Json
structures in the payload of the posting list in a Lucene index. JSearch
has
a simple query language that combines fulltext search and structural
querying. To qualify as a match, a document has to match both the JSON
structures as well as the Boolean constraints specified in the query.
Suppose
that we have indexed the following two JSON documents:
   d1={ A: [ { B: “b1”,  C: “c1” },
             { B: “b2”,  C: “c2” },
           ]
      }
   d2={ A: [ { B: “b1”,  C: “c2” },
             { B: “b2”,  C: “c1” },
           ]
      }
One can issue the following two JSeach queries.
   P={ A: [ { B: “b1” && C: “c1” } ] }
   Q={ A: [ { B: “b1”} && {C: “c1” } ] }
Query P (“&&” specifies conjunction) matches d1, but not d2. The reason is
that d2 doesn’t have the proper B and C fields within the same JSON object.
On the other hand, query Q matches both d1 and d2, since it doesn’t require
the B field and the C field to be in the same JSON object.

Here is a summary of the querying features in JSearch
1. arbitrary conjunctive and disjunctive constraints
2. text search on atomic values of string type
3. range constraints on atomic values (only those of string and long types
are
   currently supported)
4. document level matching

The easiest way to know more about JSeach is to give it a try. Download the
attached tgz file (I couldn't include the Lucene library because of the
limit on message size. So you have to download it separately yourself).
Follow the readme file in it and try some of the examples. The attachment
also includes all Java source code (I can provide more technical details if
needed). I am very interested in your feedback. Does JSearch fit into
CouchDB?
What other features are needed? How should JSearch be integrated (from
Jan's
mail, it seems that some infrastructure is already in-place)? Thanks,

(See attached file: jsearch.tgz)

Jun
IBM Almaden Research Center
K55/B1, 650 Harry Road, San Jose, CA  95120-6099

junrao@almaden.ibm.com
(408)927-1886 (phone)
(408)927-3215 (fax)


Jan Lehnardt <jan@apache.org> wrote on 05/10/2008 11:56:10 AM:

> Heya folks,
> this mail is an introduction for Jun and Bo from IBM who
> would like to contribute JSearch[1] to CouchDB. JSearch
> sits on top of Lucene so this clearly affects our fulltext
> search. All cheers to Jun and Bo I say! :-)
>
> I'll summarise what the current state is and what is planned to
> give a basis for discussion of how things could be integrated.
>
> Fulltext search separates indexing and searching.
>
> Indexing works like this: In couch.ini you specify a standalone
> daemon with the DbUpdateNotificationProcess setting. This
> daemon gets launched by CouchDB when it starts up. The
> daemon is supposed to listen on stdin for notifications
> from CouchDB.
>
> Each time a database in CouchDB is changed, CouchDB sends
> a JSON object over stdio to the notification daemon:
> {"type": "updated", "db":"database_name"}\n
> CouchDB expects no answer. The indexer can then do whatever
> he wants, for example polling CouchDB for the latest changes and
> save them into a fulltext index. The JSON structure might be
> expanded in the future, but in a backwards compatible
> manner (after 1.0, before 1.0 we might break everything :-).
>
> On this end, I think it would be nice to have a set of scripts that
> make it easy to register for events in all major languages so that
> people don't have to reimplement the listening and polling parts
> and concentrate on what they actually want to accomplish, but no
> design or work went into this direction.
>
>
> Searching works very similar in that a deamon listens on stdin
> for commands from CouchDB. The protocol is a little more complex
> here because it requires two-way communication.
> CouchDB exposes the search part over the HTTP API. At the
> moment you can call http://server:5984/database/_search?q="searchstring"
> and CouchDB will send this to the searcher daemon:
> database\n
> searchstring\n
> \n
> The searcher is expected to answer either with:
> error\n
> reason\n
> \n
>
> or
>
> ok\n
> docid\n
> score\n
> docid\n
> score\n
> .
> .
> .
> \n
>
> And CouchDB takes this list and returns it wrapped in JSON back to the
> caller.
>
> This is the state but I'd like to see some changes:
>
> I think we should move here from plaintext to JSON as well to gain a bit
> more flexibility. The basic idea is that this mechanism is good for
> any kind
> of indexing, not just fulltext. A friend of mine is already working on
> geo-
> searching with this interface[2]. (In this light, I propose drop the
> "fulltext" or
> "ft" label from the source for clarification).
>
> So we could handle calls like http://server:5984/database/_search?
> q="query"&some_custom_arg=value&other_arg=othervalue and pass it
> to the searcher API as:
> {"db":"database", "args":[{"q":"query"}, {"some_custom_arg":'value"},
> {"other_arg":"other_value'}]}\n
> \n
> and expect back a JSON result as well: either in single chunks or one
> huge object:
>
> Chunks:
> {"ok":"true"}\n (or {"error":"reason"`}\n\n)
> {"id":"docid", "score":"score"}\n
> {"id":"docid", "score":"score"}\n
> {"id":"docid", "score":"score"}\n
> ...
> \n
>
> Huge:
> {"ok":"true", result: [
> {"id":"docid", "score":"score"},
> {"id":"docid", "score":"score"},
> {"id":"docid", "score":"score"},
> ]}\n
> \n
>
> This would allow us to enable searchers to add custom values to the
> results
> and have CouchDB just add them transparently to the result set (like
> with the
> transparent handling of additional HTTP query arguments).
>
> All of those changes are just to explain the direction I wish to see
> this go in,
> no very well thought out proposals. I really appreciate your feedback
> and
> input here.
>
> I think we do have a halfway working indexer and searcher written for
> Java
> Lucene. I wrote some code for that a year ago and somebody (please
> step up!)
> improved that to work on the current CouchDB. But this certainly could
> use some
> work and any contributions here are very welcome (read: I don't want
> to do it).
>
> One more future direction that was discussed inconclusively before was
> the
> fulltext indexing of views. The general consensus was that we want to
> have it,
> but haven't figured out a good way to actually implement it. The
> mailing list
> archives have some valuable posts on that.
>
> So this is the current state. Now it's your turn :-)  How would
> JSearch fit into
> all this? I'm happy to help with any integration questions and
> suggestions for
> improvements on the CouchDB side, but I'd prefer not to have to deal
> with
> the Java side of things.
>
> Oh, and one more point Noah Slater brought up in IRC: Adding Java as a
> default requirement to CouchDB is quite heavy. And we need to discuss
> how this is supposed to be packaged and distributed with CouchDB.
>
> Cheers
> Jan
> --
>
> [1] I could swear there was a website but I can't find it anymore.
> So Jun an Bo, could you introduce JSearch to the others here?
>
> [2] http://vmx.cx/cgi-bin/blog/index.cgi
Mime
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message