Mailing-List: contact couchdb-dev-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: couchdb-dev@incubator.apache.org
Received-SPF: pass (nike.apache.org: domain of jan@prima.de designates
 83.97.50.139 as permitted sender)
Message-Id: <6046F6B0-F01F-4FA5-B414-E04847E3F76C@prima.de>
From: Jan Lehnardt <jan@prima.de>
To: couchdb-dev@incubator.apache.org
In-Reply-To: <200804151527.44308.sh@widetrail.dk>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed; delsp=yes
Content-Transfer-Encoding: quoted-printable
Mime-Version: 1.0 (Apple Message framework v919.2)
Subject: Re: Lazy Fulltext Search
Date: Tue, 15 Apr 2008 15:38:07 +0200
References: <D27A5EDE-DEC9-457F-8C85-E6653921B699@apache.org>
 <4803F04D.4040209@naderman.de>
 <8D9B4B3B-0F5B-49F3-9853-2AC6D2E88030@apache.org>
 <200804151527.44308.sh@widetrail.dk>

Heya S=F8ren,
On Apr 15, 2008, at 15:27, Soren Hilmer wrote:
> I guess what all this boils down to is that:
>
> When a database changes, you need to re-index all the views in the
> fulltextsearch design document.

if you take this route. yes.

> There are no way incremental changes can be made to the index as one =20=

> document
> change may potentially change more view results within the same view.
> Right?

Yup.

Eventually, I think, we will be able to have CouchDB calculate the =20
intersection of all FT hits and a view index for you. So the FT =20
indexer will only need to index the whole DB and CouchDB filters out =20
all matching documents that are not in the requested view for you. For =20=

now, you've got to do it yourself.

Cheers
Jan
--


>
>
> --S=F8ren
>
>
> On Tuesday 15 April 2008 14:05:38 Jan Lehnardt wrote:
>> On Apr 15, 2008, at 02:01, Nils Adermann wrote:
>>> Hi,
>>>
>>> I agree with S=F8ren that this is not necessarily a good idea. It is
>>> not trivial for an indexer to figure out which view results changed.
>>> One method to so is storing all indexed view results and then
>>> comparing them to the updated view once the indexer is called. This
>>> is a needless waste of resources. Updating the view index based on
>>> changed documents is even more difficult. You would have to
>>> recompute the view at least partially to find out which view results
>>> changed. Given the reduce step this means that any number of
>>> documents, including unchanged ones could be involved. This creates
>>> a lot of work.
>>
>> Yeah, but it doesn't actually matter who does the work :) So we =20
>> rather
>> keep that out of CouchDB.
>>
>>> I think the problem we face here is different usage patterns of
>>> views. There are views which process a lot of data and which are
>>> based on documents that are updated frequently.  But they might only
>>> be read from infrequently. These views profit from JIT computation.
>>> However many applications use views which are infrequently updated
>>> but often queried or searched. Such views benefit from live
>>> updating. If an application allows searching data it nearly always
>>> means that the data will be read more frequently than it is updated.
>>> So in conclusion both methods (JIT and live updates) make sense for
>>> views. But search normally only needs the live update mechanism. I
>>> believe it should become configurable whether a view is updated
>>> immediately after a change or only after a query takes place.
>>> Fulltext search would always work on views with immediate updates.
>>> The indexer would be notified about the changed results. On views
>>> which delay updates, search would only work if the fulltext search
>>> provides a mechanism to compare the new view results to the old =20
>>> ones.
>>
>> Just query the view with ?count=3D0 to trigger an update after your
>> inserts and you have the synchronous update behaviour.
>>
>>> Cheers
>>> Nils
>>>
>>> Jan Lehnardt wrote:
>>>> On Apr 12, 2008, at 12:06, S=F8ren Hilmer wrote:
>>>>> Hi
>>>>>
>>>>> Have you read Chris' response about letting the view engine call
>>>>> the indexer,
>>>>> as it has the information needed for the indexer? As I understand
>>>>> the idea,
>>>>> it will essentially keep the fulltext indexer and the views in =20
>>>>> sync.
>>>>>
>>>>> I like this idea and I believe the code for the indexer would be
>>>>> much simpler
>>>>> and efficient.
>>>>>
>>>>> Also as the shift goes towards indexing views and not documents,
>>>>> it makes
>>>>> sense that it is the View engine that triggers the indexer, right?
>>>>
>>>> The only problem here is that views are changed, when they are
>>>> being queried and not when documents are added. So you could end up
>>>> with a lot of not-indexed data because your view hasn't been
>>>> queried. That can be worked around, but I don't think it makes
>>>> things any easier :)
>>>>
>>>> The design of the update notification is intentionally simple. We
>>>> expect the clients (the Indexer in this case) to be smart. We
>>>> believe that this makes the server code is more robust in that way.
>>>>
>>>>> I have to study the View engine, if I am to provide any code for
>>>>> this, though
>>>>> (provided consensus blows in this direction).
>>>>>
>>>>> Have fun
>>>>> S=F8ren
>>>>>
>>>>> On Friday 11 April 2008 13:26, Jan Lehnardt wrote:
>>>>>> On Apr 11, 2008, at 08:55, S=F8ren Hilmer wrote:
>>>>>>> Hi Jan
>>>>>>>
>>>>>>> It certainly would simplify configuration, allthough the
>>>>>>> DbUpdateNotificationProcess setting ought to be retained as it =20=

>>>>>>> is
>>>>>>> potentially usefull for other stuff than indexing (can you have
>>>>>>> more
>>>>>>> than
>>>>>>> one of these, setup?)
>>>>>>
>>>>>> No, the update searcher will stay! :-)
>>>>>>
>>>>>>> I am also worried about responsetimes for searching, potentially
>>>>>>> the
>>>>>>> indexing can take considerable time. With the current approach
>>>>>>> indexing
>>>>>>> can be done off peak hours and only searching is done at prime
>>>>>>> time.
>>>>>>
>>>>>> Right, if you want to be conservative with resources, you might
>>>>>> want
>>>>>> togo
>>>>>> with my approach at the expense of possibly higher response times
>>>>>> the
>>>>>> first time things are searched for (as it is with views). I just
>>>>>> wanted to make
>>>>>> available my idea that fulltext indexing could be modelled after
>>>>>> how
>>>>>> views
>>>>>> work, in case this is useful for a specific scenario.
>>>>>>
>>>>>> Cheers
>>>>>> Jan
>>>>>> --
>>>>>>
>>>>>>> Have fun
>>>>>>> S=F8ren
>>>>>>> --
>>>>>>> S=F8ren Hilmer, M.Sc., M.Crypt.
>>>>>>> wideTrail            Phone: +45 25481225
>>>>>>> Pilev=E6nget 41        Email: sh@widetrail.dk
>>>>>>> DK-8961  Alling=E5bro  Web: www.widetrail.dk
>>>>>>>
>>>>>>> On Thu, April 10, 2008 23:32, Jan Lehnardt wrote:
>>>>>>>> Heya,
>>>>>>>> while thinking more about the fulltext implementation, I =20
>>>>>>>> began to
>>>>>>>> wonder why we don't model it after the view engine.
>>>>>>>>
>>>>>>>> At the moment, we have an Indexer waiting for update
>>>>>>>> notifications
>>>>>>>> and
>>>>>>>> polling CouchDB for changes and a separate mechanism to
>>>>>>>> register a
>>>>>>>> fulltext query Searcher, that looks up things in the index.
>>>>>>>>
>>>>>>>> My proposed architectural change would be to trigger the
>>>>>>>> Indexer from
>>>>>>>> the Searcher module when a request comes in, just like views
>>>>>>>> work.
>>>>>>>> This would delay the creation of fulltext indexes until they =20=

>>>>>>>> are
>>>>>>>> actually needed.
>>>>>>>>
>>>>>>>> The possible drawback though is, that when building the =20
>>>>>>>> fulltext
>>>>>>>> index
>>>>>>>> is rather slow, old-style pre-calculation might be more =20
>>>>>>>> feasible.
>>>>>>>> View
>>>>>>>> deal with that by requiring frequent requests (possibly cron-=20=

>>>>>>>> ed).
>>>>>>>>
>>>>>>>> This is not a proposal or anything, just a thought I wanted to
>>>>>>>> share
>>>>>>>> with those who work on fulltext integration.
>>>>>>>>
>>>>>>>> If you have any input on this, please let us know ;)
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>> Jan
>>>>>>>> --
>>>>>
>>>>> --
>>>>> S=F8ren Hilmer, M.Sc., M.Crypt.
>>>>> wideTrail            Phone:    +45 25481225
>>>>> Pilev=E6nget 41        Email:    sh@widetrail.dk
>>>>> DK-8961  Alling=E5bro    Web:    www.widetrail.dk
>
>
>
> --=20
> S=F8ren Hilmer, M.Sc., M.Crypt.
> wideTrail                       Phone:  +45 25481225
> Pilev=E6nget 41           Email:  sh@widetrail.dk
> DK-8961  Alling=E5bro     Web:    www.widetrail.dk
>