Return-Path: Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: (qmail 69626 invoked from network); 29 Mar 2011 17:50:51 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 29 Mar 2011 17:50:51 -0000 Received: (qmail 37497 invoked by uid 500); 29 Mar 2011 17:50:50 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 37466 invoked by uid 500); 29 Mar 2011 17:50:50 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 37458 invoked by uid 99); 29 Mar 2011 17:50:50 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 29 Mar 2011 17:50:50 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_HELO_PASS X-Spam-Check-By: apache.org Received-SPF: unknown ~alla (nike.apache.org: encountered unrecognized mechanism during SPF processing of domain of david@cloudant.com) Received: from [216.86.168.183] (HELO mxout-08.mxes.net) (216.86.168.183) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 29 Mar 2011 17:50:43 +0000 Received: from [192.168.1.100] (unknown [67.180.255.146]) by smtp.mxes.net (Postfix) with ESMTPA id 7D60A509DB for ; Tue, 29 Mar 2011 13:50:21 -0400 (EDT) Message-ID: <4D921BDC.1090700@cloudant.com> Date: Tue, 29 Mar 2011 10:50:20 -0700 From: David Hardtke User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.14) Gecko/20110223 Thunderbird/3.1.8 MIME-Version: 1.0 To: user@couchdb.apache.org Subject: Re: Full text search - is it coming? If yes, approx when. References: <41564f51-bb8a-4fbe-a984-941d19852c06@HUB25.4emm.local> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Hi All, If we're discussing a native CouchDB full text search, I'd like to point out a few things about Cloudant's implementation that might guide the design. Our search indexing strategy is discussed here: http://support.cloudant.com/kb/search/search-indexing What we've opted to do is borrow certain parts of Lucene (particularly the analyzer, query parser, and searching/scoring) but completely rewrite the inverted index storage format. We store the inverted index as a CouchDB map-reduce view. This is of course inefficient in terms of disk space, but great in terms of taking advantage of CouchDB's robustness and job scheduling capabilities. The map format for the inverted index is the following: {"id":doc_id:"key":[lucene_field,term],"value":[[array with list of term positions in document]]} In the web link above, you can see the inverted index for a single document with id "example_glossary". Search in Lucene is scored using the tf-idf model. tf = term frequency = number of times a token appears in a particular document. idf = inverse document frequency = how often a token appears in any document in the collection. The map view gives us the information we need for tf, and to get idf we use the Erlang builtin _count reduce function. We have separate processing for indexing and search -- the indexing is just a regular view server (we use our Java view server by default, but there is nothing to prevent anyone from using a javascript or erlang map function to create the inverted index). The "searcher application" knows how to utilize these CouchDB views to do search query logic (boolean queries, phrase queries, range queries, etc.). The searcher application knows nothing about who created the map-reduce views that it utilizes -- only that the format is as specified above (the one caveat is that the searcher application needs to know the analyzer to work properly). Just as with clucene vs. lucene, we could standardize the format of the CouchDB inverted indices for any "native" search applications. Writing a simple Erlang based "searcher application" shouldn't be very difficult (I plan to do it at some point) -- one option would be to extend Norman's multiview. This erlang "searcher application" would work on mobile devices. I propose that we have a standard inverted index format, that this inverted index is stored in a CouchDB view, and that all indexing and search applications recognize this standard format. Keep in mind that for many applications external services such as couchdb-lucene and ElasticSearch will be superior to a native CouchDB search solution -- these are optimized for efficiency of inverted index lookups. Dave On 03/29/11 08:57, Albin Stigo wrote: > Couchdb + Lucene (Elasticsearch etc.) is a really great combination > and definitely enough on the server side... IMHO what is missing is a > full text engine for Couchdb on mobile - that would be a killer... > Currently the only full text search library on mobile devices is > sqlite fts3 which is great but doesn't have replication. Maybe someone > could implement something based on sqlite fts3 which uses the changes > stream to keep in sync...? > > How do you search couchdb on mobile devices? > > --Albin > > On Tue, Mar 29, 2011 at 4:09 PM, Norman Barker wrote: >> Benoit, >> >> interesting post on Lucy, I have been monitoring that as well (and >> though no where near as good as Robert's work) I have integrated >> clucene and couchdb as I was looking for a solution that didn't use >> Java. >> >> I see a trend with couchdb and NIFs, what is the official standpoint >> here, test and test the c / c++ library so that any chance of bringing >> the VM down is reduced? I know with Java and JNI in an app server you >> are taking a huge risk (heartbeat works, but an app server takes >> several minutes to start up), with Erlang are you relying on the >> heartbeat service to restart the VM in case of failure? >> >> I am interested in helping with any NIF on top of Lucy. >> >> thanks, >> >> Norman >> >> On Tue, Mar 29, 2011 at 7:58 AM, Simon Metson >> wrote: >>> Does http://blog.cloudant.com/developer-preview-cloudant-search-for-couchdb/ help wrt. the original post? Cloudant's search is built on Lucene. >>> Cheers >>> Simon >>> >>> Sent with Sparrow >>> On Tuesday, 29 March 2011 at 14:24, Dennis Geurts wrote: >>> Hi all, >>>> Looking at the amount of replies wrt to this topic it seems there's much interest in full text searching. >>>> >>>> It's really hard to tell how one would expect this feature to be implemented in couchdb in such a way that it would supersede the nice couchdb-lucene combo. >>>> >>>> That said, if you want a _really simple_ (and probably bad solution performance wise!) fulltext search implementation, have a look at couchdb lists. >>>> >>>> You decide which _view is sent to the _list function; within the _list function you can implement your full text search by inspecting the document data in javascript. >>>> >>>> This setup at least allows for replication of the fts functionality and might be just enough for the OP. >>>> >>>> >>>> >>>> Cheers, dennis >>>> >>>> >>>> >>>> ----- Reply message ----- >>>> From: "Zdravko Gligic" >>>> Date: Tue, Mar 29, 2011 13:49 >>>> Subject: Full text search - is it coming? If yes, approx when. >>>> To: "user@couchdb.apache.org" >>>> >>>> I have a bit tricky use case of super tagging or rather a somewhat >>>> hierarchical docs categorization. Several CouchDB gurus have suggested >>>> that I should look at Lucene and such. My problem is hosting because >>>> I would most rather go with a cloud solution such as Cloudant and >>>> forthcoming (I hope it's still forthcoming) CouchBase. Comparatively, >>>> I have very little amount of data - large number of tiny docs that are >>>> indexed every which way possible - such that the size of views dwarfs >>>> the size of docs. >>>> >>>> The full-text-searching problem is best illustrated by the >>>> full-text-searching hosting state of affairs at Cloudant and CouchBase >>>> - the only two commercial companies worth mentioning within the >>>> CouchDb community. Neither one uses Lucene out of the box and only >>>> Cloudant has their own solution. This means that I could not use a >>>> redundancy-performance perfect Master-Master replication that is >>>> hosted by both. This is why either full-text-searching needs to >>>> become CouchDb's internal first citizen or our hosting friends need to >>>> internalize and make Lucene their first class citizen. >>>> >>>> P.S. I love both but ... >>>>