incubator-couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jun Rao <jun...@almaden.ibm.com>
Subject Re: The state of the fulltext search
Date Mon, 12 May 2008 16:24:46 GMT
I opened a new JIRA and uploaded the complete JSearch package. See
https://issues.apache.org/jira/browse/COUCHDB-53

Thanks,

Jun
IBM Almaden Research Center
K55/B1, 650 Harry Road, San Jose, CA  95120-6099

junrao@almaden.ibm.com
(408)927-1886 (phone)
(408)927-3215 (fax)


Jan Lehnardt <jan@apache.org> wrote on 05/12/2008 04:22:54 AM:

>
> On May 12, 2008, at 03:16, Jun Rao wrote:
>
> > ***********************
> > Warning: Your file, jsearch.tgz, contains more than 32 files after
> > decompression and cannot be scanned.
> > ***********************
>
> Heya Jun,
> it looks like your attachment got stripped from the mail.
> Can you upload it somewhere?
>
> Cheers
> Jan
> --
>
>
> > Jan,
> >
> > Thanks for the introduction.
> >
> > JSearch is a prototype that we developed for indexing and searching
> > Json
> > documents, and we are enthusiastic about contributing it to CouchDB.
> > JSearch
> > converts a given Json document to a Lucene document for indexing. The
> > conversion is lossless and preserves all structural information in the
> > original Json document. We achieve that by storing the encoding of
> > Json
> > structures in the payload of the posting list in a Lucene index.
> > JSearch
> > has
> > a simple query language that combines fulltext search and structural
> > querying. To qualify as a match, a document has to match both the JSON
> > structures as well as the Boolean constraints specified in the query.
> > Suppose
> > that we have indexed the following two JSON documents:
> >   d1={ A: [ { B: “b1”,  C: “c1” },
> >             { B: “b2”,  C: “c2” },
> >           ]
> >      }
> >   d2={ A: [ { B: “b1”,  C: “c2” },
> >             { B: “b2”,  C: “c1” },
> >           ]
> >      }
> > One can issue the following two JSeach queries.
> >   P={ A: [ { B: “b1” && C: “c1” } ] }
> >   Q={ A: [ { B: “b1”} && {C: “c1” } ] }
> > Query P (“&&” specifies conjunction) matches d1, but not d2. The
> > reason is
> > that d2 doesn’t have the proper B and C fields within the same JSON
> > object.
> > On the other hand, query Q matches both d1 and d2, since it doesn’t
> > require
> > the B field and the C field to be in the same JSON object.
> >
> > Here is a summary of the querying features in JSearch
> > 1. arbitrary conjunctive and disjunctive constraints
> > 2. text search on atomic values of string type
> > 3. range constraints on atomic values (only those of string and long
> > types
> > are
> >   currently supported)
> > 4. document level matching
> >
> > The easiest way to know more about JSeach is to give it a try.
> > Download the
> > attached tgz file (I couldn't include the Lucene library because of
> > the
> > limit on message size. So you have to download it separately
> > yourself).
> > Follow the readme file in it and try some of the examples. The
> > attachment
> > also includes all Java source code (I can provide more technical
> > details if
> > needed). I am very interested in your feedback. Does JSearch fit into
> > CouchDB?
> > What other features are needed? How should JSearch be integrated (from
> > Jan's
> > mail, it seems that some infrastructure is already in-place)? Thanks,
> >
> > (See attached file: jsearch.tgz)
> >
> > Jun
> > IBM Almaden Research Center
> > K55/B1, 650 Harry Road, San Jose, CA  95120-6099
> >
> > junrao@almaden.ibm.com
> > (408)927-1886 (phone)
> > (408)927-3215 (fax)
> >
> >
> > Jan Lehnardt <jan@apache.org> wrote on 05/10/2008 11:56:10 AM:
> >
> >> Heya folks,
> >> this mail is an introduction for Jun and Bo from IBM who
> >> would like to contribute JSearch[1] to CouchDB. JSearch
> >> sits on top of Lucene so this clearly affects our fulltext
> >> search. All cheers to Jun and Bo I say! :-)
> >>
> >> I'll summarise what the current state is and what is planned to
> >> give a basis for discussion of how things could be integrated.
> >>
> >> Fulltext search separates indexing and searching.
> >>
> >> Indexing works like this: In couch.ini you specify a standalone
> >> daemon with the DbUpdateNotificationProcess setting. This
> >> daemon gets launched by CouchDB when it starts up. The
> >> daemon is supposed to listen on stdin for notifications
> >> from CouchDB.
> >>
> >> Each time a database in CouchDB is changed, CouchDB sends
> >> a JSON object over stdio to the notification daemon:
> >> {"type": "updated", "db":"database_name"}\n
> >> CouchDB expects no answer. The indexer can then do whatever
> >> he wants, for example polling CouchDB for the latest changes and
> >> save them into a fulltext index. The JSON structure might be
> >> expanded in the future, but in a backwards compatible
> >> manner (after 1.0, before 1.0 we might break everything :-).
> >>
> >> On this end, I think it would be nice to have a set of scripts that
> >> make it easy to register for events in all major languages so that
> >> people don't have to reimplement the listening and polling parts
> >> and concentrate on what they actually want to accomplish, but no
> >> design or work went into this direction.
> >>
> >>
> >> Searching works very similar in that a deamon listens on stdin
> >> for commands from CouchDB. The protocol is a little more complex
> >> here because it requires two-way communication.
> >> CouchDB exposes the search part over the HTTP API. At the
> >> moment you can call http://server:5984/database/_search?
> >> q="searchstring"
> >> and CouchDB will send this to the searcher daemon:
> >> database\n
> >> searchstring\n
> >> \n
> >> The searcher is expected to answer either with:
> >> error\n
> >> reason\n
> >> \n
> >>
> >> or
> >>
> >> ok\n
> >> docid\n
> >> score\n
> >> docid\n
> >> score\n
> >> .
> >> .
> >> .
> >> \n
> >>
> >> And CouchDB takes this list and returns it wrapped in JSON back to
> >> the
> >> caller.
> >>
> >> This is the state but I'd like to see some changes:
> >>
> >> I think we should move here from plaintext to JSON as well to gain
> >> a bit
> >> more flexibility. The basic idea is that this mechanism is good for
> >> any kind
> >> of indexing, not just fulltext. A friend of mine is already working
> >> on
> >> geo-
> >> searching with this interface[2]. (In this light, I propose drop the
> >> "fulltext" or
> >> "ft" label from the source for clarification).
> >>
> >> So we could handle calls like http://server:5984/database/_search?
> >> q="query"&some_custom_arg=value&other_arg=othervalue and pass it
> >> to the searcher API as:
> >> {"db":"database", "args":[{"q":"query"}, {"some_custom_arg":'value"},
> >> {"other_arg":"other_value'}]}\n
> >> \n
> >> and expect back a JSON result as well: either in single chunks or one
> >> huge object:
> >>
> >> Chunks:
> >> {"ok":"true"}\n (or {"error":"reason"`}\n\n)
> >> {"id":"docid", "score":"score"}\n
> >> {"id":"docid", "score":"score"}\n
> >> {"id":"docid", "score":"score"}\n
> >> ...
> >> \n
> >>
> >> Huge:
> >> {"ok":"true", result: [
> >> {"id":"docid", "score":"score"},
> >> {"id":"docid", "score":"score"},
> >> {"id":"docid", "score":"score"},
> >> ]}\n
> >> \n
> >>
> >> This would allow us to enable searchers to add custom values to the
> >> results
> >> and have CouchDB just add them transparently to the result set (like
> >> with the
> >> transparent handling of additional HTTP query arguments).
> >>
> >> All of those changes are just to explain the direction I wish to see
> >> this go in,
> >> no very well thought out proposals. I really appreciate your feedback
> >> and
> >> input here.
> >>
> >> I think we do have a halfway working indexer and searcher written for
> >> Java
> >> Lucene. I wrote some code for that a year ago and somebody (please
> >> step up!)
> >> improved that to work on the current CouchDB. But this certainly
> >> could
> >> use some
> >> work and any contributions here are very welcome (read: I don't want
> >> to do it).
> >>
> >> One more future direction that was discussed inconclusively before
> >> was
> >> the
> >> fulltext indexing of views. The general consensus was that we want to
> >> have it,
> >> but haven't figured out a good way to actually implement it. The
> >> mailing list
> >> archives have some valuable posts on that.
> >>
> >> So this is the current state. Now it's your turn :-)  How would
> >> JSearch fit into
> >> all this? I'm happy to help with any integration questions and
> >> suggestions for
> >> improvements on the CouchDB side, but I'd prefer not to have to deal
> >> with
> >> the Java side of things.
> >>
> >> Oh, and one more point Noah Slater brought up in IRC: Adding Java
> >> as a
> >> default requirement to CouchDB is quite heavy. And we need to discuss
> >> how this is supposed to be packaged and distributed with CouchDB.
> >>
> >> Cheers
> >> Jan
> >> --
> >>
> >> [1] I could swear there was a website but I can't find it anymore.
> >> So Jun an Bo, could you introduce JSearch to the others here?
> >>
> >> [2] http://vmx.cx/cgi-bin/blog/index.cgi
>
Mime
View raw message