Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@couchdb.apache.org
Received-SPF: unknown ~alla (nike.apache.org: encountered unrecognized
 mechanism during SPF processing of domain of david@cloudant.com)
Message-ID: <4D921BDC.1090700@cloudant.com>
Date: Tue, 29 Mar 2011 10:50:20 -0700
From: David Hardtke <david@cloudant.com>
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US;
 rv:1.9.2.14) Gecko/20110223 Thunderbird/3.1.8
MIME-Version: 1.0
To: user@couchdb.apache.org
Subject: Re: Full text search - is it coming? If yes, approx when.
References: <41564f51-bb8a-4fbe-a984-941d19852c06@HUB25.4emm.local>
	<DC5D92D3FB1C4F629662A84EB1908E68@googlemail.com>
	<AANLkTi=1wrnja0PpO4VfB4C6uOz_O_VpSno0BPvGpJj1@mail.gmail.com>
 <AANLkTi=T8QArPUNpqX5RmG6ybZLv=B+em8cUO=NLf=UT@mail.gmail.com>
In-Reply-To: <AANLkTi=T8QArPUNpqX5RmG6ybZLv=B+em8cUO=NLf=UT@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Hi All,

If we're discussing a native CouchDB full text search, I'd like to point 
out a few things about Cloudant's implementation that might guide the 
design.  Our search indexing strategy is discussed here:

http://support.cloudant.com/kb/search/search-indexing

What we've opted to do is borrow certain parts of Lucene (particularly 
the analyzer, query parser, and searching/scoring) but completely 
rewrite the inverted index storage format.  We store the inverted index 
as a CouchDB map-reduce view.  This is of course inefficient in terms of 
disk space, but great in terms of taking advantage of CouchDB's 
robustness and job scheduling capabilities.

The map format for the inverted index is the following:

{"id":doc_id:"key":[lucene_field,term],"value":[[array with list of term 
positions in document]]}

In the web link above, you can see the inverted index for a single 
document with id "example_glossary".

Search in Lucene is scored using the tf-idf model.  tf = term frequency 
= number of times a token appears in a particular document. idf = 
inverse document frequency = how often a token appears in any document 
in the collection.  The map view gives us the information we need for 
tf, and to get idf we use the Erlang builtin _count reduce function.

We have separate processing for indexing and search -- the indexing is 
just a regular view server (we use our Java view server by default, but 
there is nothing to prevent anyone from using a javascript or erlang map 
function to create the inverted index).

The "searcher application" knows how to utilize these CouchDB views to 
do search query logic (boolean queries, phrase queries, range queries, 
etc.).  The searcher application knows nothing about who created the 
map-reduce views that it utilizes -- only that the format is as 
specified above (the one caveat is that the searcher application needs 
to know the analyzer to work properly).

Just as with clucene vs. lucene, we could standardize the format of the 
CouchDB inverted indices for any "native" search applications.  Writing 
a simple Erlang based "searcher application" shouldn't be very difficult 
(I plan to do it at some point) -- one option would be to extend 
Norman's multiview.  This erlang "searcher application" would work on 
mobile devices.

I propose that we have a standard inverted index format, that this 
inverted index is stored in a CouchDB view, and that all indexing and 
search applications recognize this standard format.  Keep in mind that 
for many applications external services such as couchdb-lucene and 
ElasticSearch will be superior to a native CouchDB search solution -- 
these are optimized for efficiency of inverted index lookups.

Dave


On 03/29/11 08:57, Albin Stigo wrote:
> Couchdb + Lucene (Elasticsearch etc.) is a really great combination
> and definitely enough on the server side... IMHO what is missing is a
> full text engine for Couchdb on mobile - that would be a killer...
> Currently the only full text search library on mobile devices is
> sqlite fts3 which is great but doesn't have replication. Maybe someone
> could implement something based on sqlite fts3 which uses the changes
> stream to keep in sync...?
>
> How do you search couchdb on mobile devices?
>
> --Albin
>
> On Tue, Mar 29, 2011 at 4:09 PM, Norman Barker<norman.barker@gmail.com>  wrote:
>> Benoit,
>>
>> interesting post on Lucy, I have been monitoring that as well (and
>> though no where near as good as Robert's work) I have integrated
>> clucene and couchdb as I was looking for a solution that didn't use
>> Java.
>>
>> I see a trend with couchdb and NIFs, what is the official standpoint
>> here, test and test the c / c++ library so that any chance of bringing
>> the VM down is reduced? I know with Java and JNI in an app server you
>> are taking a huge risk (heartbeat works, but an app server takes
>> several minutes to start up), with Erlang are you relying on the
>> heartbeat service to restart the VM in case of failure?
>>
>> I am interested in helping with any NIF on top of Lucy.
>>
>> thanks,
>>
>> Norman
>>
>> On Tue, Mar 29, 2011 at 7:58 AM, Simon Metson
>> <simonmetson@googlemail.com>  wrote:
>>> Does http://blog.cloudant.com/developer-preview-cloudant-search-for-couchdb/ help wrt. the original post? Cloudant's search is built on Lucene.
>>> Cheers
>>> Simon
>>>
>>> Sent with Sparrow
>>> On Tuesday, 29 March 2011 at 14:24, Dennis Geurts wrote:
>>> Hi all,
>>>> Looking at the amount of replies wrt to this topic it seems there's much interest in full text searching.
>>>>
>>>> It's really hard to tell how one would expect this feature to be implemented in couchdb in such a way that it would supersede the nice couchdb-lucene combo.
>>>>
>>>> That said, if you want a _really simple_ (and probably bad solution performance wise!) fulltext search implementation, have a look at couchdb lists.
>>>>
>>>> You decide which _view is sent to the _list function; within the _list function you can implement your full text search by inspecting the document data in javascript.
>>>>
>>>> This setup at least allows for replication of the fts functionality and might be just enough for the OP.
>>>>
>>>>
>>>>
>>>> Cheers, dennis
>>>>
>>>>
>>>>
>>>> ----- Reply message -----
>>>> From: "Zdravko Gligic"<zgligic@gmail.com>
>>>> Date: Tue, Mar 29, 2011 13:49
>>>> Subject: Full text search - is it coming? If yes, approx when.
>>>> To: "user@couchdb.apache.org"<user@couchdb.apache.org>
>>>>
>>>> I have a bit tricky use case of super tagging or rather a somewhat
>>>> hierarchical docs categorization. Several CouchDB gurus have suggested
>>>> that I should look at Lucene and such. My problem is hosting because
>>>> I would most rather go with a cloud solution such as Cloudant and
>>>> forthcoming (I hope it's still forthcoming) CouchBase. Comparatively,
>>>> I have very little amount of data - large number of tiny docs that are
>>>> indexed every which way possible - such that the size of views dwarfs
>>>> the size of docs.
>>>>
>>>> The full-text-searching problem is best illustrated by the
>>>> full-text-searching hosting state of affairs at Cloudant and CouchBase
>>>> - the only two commercial companies worth mentioning within the
>>>> CouchDb community. Neither one uses Lucene out of the box and only
>>>> Cloudant has their own solution. This means that I could not use a
>>>> redundancy-performance perfect Master-Master replication that is
>>>> hosted by both. This is why either full-text-searching needs to
>>>> become CouchDb's internal first citizen or our hosting friends need to
>>>> internalize and make Lucene their first class citizen.
>>>>
>>>> P.S. I love both but ...
>>>>