incubator-couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Lehnardt <...@apache.org>
Subject Re: The state of the fulltext search
Date Mon, 12 May 2008 11:22:54 GMT

On May 12, 2008, at 03:16, Jun Rao wrote:

> ***********************
> Warning: Your file, jsearch.tgz, contains more than 32 files after  
> decompression and cannot be scanned.
> ***********************

Heya Jun,
it looks like your attachment got stripped from the mail.
Can you upload it somewhere?

Cheers
Jan
--


> Jan,
>
> Thanks for the introduction.
>
> JSearch is a prototype that we developed for indexing and searching  
> Json
> documents, and we are enthusiastic about contributing it to CouchDB.
> JSearch
> converts a given Json document to a Lucene document for indexing. The
> conversion is lossless and preserves all structural information in the
> original Json document. We achieve that by storing the encoding of  
> Json
> structures in the payload of the posting list in a Lucene index.  
> JSearch
> has
> a simple query language that combines fulltext search and structural
> querying. To qualify as a match, a document has to match both the JSON
> structures as well as the Boolean constraints specified in the query.
> Suppose
> that we have indexed the following two JSON documents:
>   d1={ A: [ { B: “b1”,  C: “c1” },
>             { B: “b2”,  C: “c2” },
>           ]
>      }
>   d2={ A: [ { B: “b1”,  C: “c2” },
>             { B: “b2”,  C: “c1” },
>           ]
>      }
> One can issue the following two JSeach queries.
>   P={ A: [ { B: “b1” && C: “c1” } ] }
>   Q={ A: [ { B: “b1”} && {C: “c1” } ] }
> Query P (“&&” specifies conjunction) matches d1, but not d2. The  
> reason is
> that d2 doesn’t have the proper B and C fields within the same JSON  
> object.
> On the other hand, query Q matches both d1 and d2, since it doesn’t  
> require
> the B field and the C field to be in the same JSON object.
>
> Here is a summary of the querying features in JSearch
> 1. arbitrary conjunctive and disjunctive constraints
> 2. text search on atomic values of string type
> 3. range constraints on atomic values (only those of string and long  
> types
> are
>   currently supported)
> 4. document level matching
>
> The easiest way to know more about JSeach is to give it a try.  
> Download the
> attached tgz file (I couldn't include the Lucene library because of  
> the
> limit on message size. So you have to download it separately  
> yourself).
> Follow the readme file in it and try some of the examples. The  
> attachment
> also includes all Java source code (I can provide more technical  
> details if
> needed). I am very interested in your feedback. Does JSearch fit into
> CouchDB?
> What other features are needed? How should JSearch be integrated (from
> Jan's
> mail, it seems that some infrastructure is already in-place)? Thanks,
>
> (See attached file: jsearch.tgz)
>
> Jun
> IBM Almaden Research Center
> K55/B1, 650 Harry Road, San Jose, CA  95120-6099
>
> junrao@almaden.ibm.com
> (408)927-1886 (phone)
> (408)927-3215 (fax)
>
>
> Jan Lehnardt <jan@apache.org> wrote on 05/10/2008 11:56:10 AM:
>
>> Heya folks,
>> this mail is an introduction for Jun and Bo from IBM who
>> would like to contribute JSearch[1] to CouchDB. JSearch
>> sits on top of Lucene so this clearly affects our fulltext
>> search. All cheers to Jun and Bo I say! :-)
>>
>> I'll summarise what the current state is and what is planned to
>> give a basis for discussion of how things could be integrated.
>>
>> Fulltext search separates indexing and searching.
>>
>> Indexing works like this: In couch.ini you specify a standalone
>> daemon with the DbUpdateNotificationProcess setting. This
>> daemon gets launched by CouchDB when it starts up. The
>> daemon is supposed to listen on stdin for notifications
>> from CouchDB.
>>
>> Each time a database in CouchDB is changed, CouchDB sends
>> a JSON object over stdio to the notification daemon:
>> {"type": "updated", "db":"database_name"}\n
>> CouchDB expects no answer. The indexer can then do whatever
>> he wants, for example polling CouchDB for the latest changes and
>> save them into a fulltext index. The JSON structure might be
>> expanded in the future, but in a backwards compatible
>> manner (after 1.0, before 1.0 we might break everything :-).
>>
>> On this end, I think it would be nice to have a set of scripts that
>> make it easy to register for events in all major languages so that
>> people don't have to reimplement the listening and polling parts
>> and concentrate on what they actually want to accomplish, but no
>> design or work went into this direction.
>>
>>
>> Searching works very similar in that a deamon listens on stdin
>> for commands from CouchDB. The protocol is a little more complex
>> here because it requires two-way communication.
>> CouchDB exposes the search part over the HTTP API. At the
>> moment you can call http://server:5984/database/_search? 
>> q="searchstring"
>> and CouchDB will send this to the searcher daemon:
>> database\n
>> searchstring\n
>> \n
>> The searcher is expected to answer either with:
>> error\n
>> reason\n
>> \n
>>
>> or
>>
>> ok\n
>> docid\n
>> score\n
>> docid\n
>> score\n
>> .
>> .
>> .
>> \n
>>
>> And CouchDB takes this list and returns it wrapped in JSON back to  
>> the
>> caller.
>>
>> This is the state but I'd like to see some changes:
>>
>> I think we should move here from plaintext to JSON as well to gain  
>> a bit
>> more flexibility. The basic idea is that this mechanism is good for
>> any kind
>> of indexing, not just fulltext. A friend of mine is already working  
>> on
>> geo-
>> searching with this interface[2]. (In this light, I propose drop the
>> "fulltext" or
>> "ft" label from the source for clarification).
>>
>> So we could handle calls like http://server:5984/database/_search?
>> q="query"&some_custom_arg=value&other_arg=othervalue and pass it
>> to the searcher API as:
>> {"db":"database", "args":[{"q":"query"}, {"some_custom_arg":'value"},
>> {"other_arg":"other_value'}]}\n
>> \n
>> and expect back a JSON result as well: either in single chunks or one
>> huge object:
>>
>> Chunks:
>> {"ok":"true"}\n (or {"error":"reason"`}\n\n)
>> {"id":"docid", "score":"score"}\n
>> {"id":"docid", "score":"score"}\n
>> {"id":"docid", "score":"score"}\n
>> ...
>> \n
>>
>> Huge:
>> {"ok":"true", result: [
>> {"id":"docid", "score":"score"},
>> {"id":"docid", "score":"score"},
>> {"id":"docid", "score":"score"},
>> ]}\n
>> \n
>>
>> This would allow us to enable searchers to add custom values to the
>> results
>> and have CouchDB just add them transparently to the result set (like
>> with the
>> transparent handling of additional HTTP query arguments).
>>
>> All of those changes are just to explain the direction I wish to see
>> this go in,
>> no very well thought out proposals. I really appreciate your feedback
>> and
>> input here.
>>
>> I think we do have a halfway working indexer and searcher written for
>> Java
>> Lucene. I wrote some code for that a year ago and somebody (please
>> step up!)
>> improved that to work on the current CouchDB. But this certainly  
>> could
>> use some
>> work and any contributions here are very welcome (read: I don't want
>> to do it).
>>
>> One more future direction that was discussed inconclusively before  
>> was
>> the
>> fulltext indexing of views. The general consensus was that we want to
>> have it,
>> but haven't figured out a good way to actually implement it. The
>> mailing list
>> archives have some valuable posts on that.
>>
>> So this is the current state. Now it's your turn :-)  How would
>> JSearch fit into
>> all this? I'm happy to help with any integration questions and
>> suggestions for
>> improvements on the CouchDB side, but I'd prefer not to have to deal
>> with
>> the Java side of things.
>>
>> Oh, and one more point Noah Slater brought up in IRC: Adding Java  
>> as a
>> default requirement to CouchDB is quite heavy. And we need to discuss
>> how this is supposed to be packaged and distributed with CouchDB.
>>
>> Cheers
>> Jan
>> --
>>
>> [1] I could swear there was a website but I can't find it anymore.
>> So Jun an Bo, could you introduce JSearch to the others here?
>>
>> [2] http://vmx.cx/cgi-bin/blog/index.cgi


Mime
View raw message