lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dror Matalon <d...@zapatec.com>
Subject Re: Real time indexing and distribution to lucene on separate boxes (long)
Date Thu, 11 Mar 2004 19:28:44 GMT
To clarify how option 3 works:

You have dira where the search is done and dirb where the indexing is
done. dirb grows when you add new items to it, and at some point you
swap and dirb becomes dira, but what do you do then?

Also, how do you write from the indexer to the directory on the search box?

We had two different ideas on how to do this, we ended up
implementing option 2, but you need to have support for snapshots to be
able to do it.

1. Similar idea to what you were suggesting, but optimizing the index:
	1. The searcher reads from dira.
	2. The writer writes to dirb which contains only *new* documents that
	aren't in dira.
	3. Periodically the writer stops writing and merges/optimizes dira and
	dirb to dirc.
	4. When done, the writer removes dirb and starts writing to it new
	documents. 
	5. The searcher notices that dirc is ready. Renames dira to something
	else, renames dirc to dira.


The main problem with this approach is that the merging/optimizing can
be slow on large indexes. Even on fast machines with fast disks merging
several Gigs takes a while.

2. The index is NFS mounted. The indexer keeps writing to the index, and
at defined times, creates a NFS snapshot of the index. It then creates
an entry in a db to let the searcher know that a new snapshot has been
created.
The searcher checks once a minute the db to see if there's a new
snapshot. If there is one, it opens the index in the new snapshot and
swaps it for the old one. The code to do this is synchronized.

The nice thing about this solution is that you don't have just one copy
of the index and don't do any copying. But you need to use NFS and
snapshots.


Dror

On Thu, Mar 11, 2004 at 09:21:07AM -0800, Doug Cutting wrote:
> Kevin A. Burton wrote:
>  > 3. Have two directories on the searcher.  The indexer would then 
> sync to
> >a tmp
> >directory and then at run time swap them via a rename once the sync is 
> >over.
> >The downside here is that this will take up 2x disk space on the 
> >searcher.  The
> >upside is that the box will only slow down while the rsync is happening.
> 
> For maximal search performance, this is your best bet.  Disk space is 
> cheap.  At some point all newly issued queries start going against the 
> new index, and, pretty soon, you can close and delete the old index. 
> But you never have to stop searching.  Disk i/o will spike a bit during 
> the changeover, as the new working set is swapped in.
> 
> Note that, if folks are paging through results by re-querying (the 
> standard method) and the index is updated then things can get funky. 
> One approach is to hang onto the old index longer, to make this less 
> likely.  In any case, you might want to add an index-id to the search 
> parameters, so that, if a next-page is issued when the index is no 
> longer there you can give some sort of "stale query" error.
> 
> >PS.. Random question.  The performance of the MultiSearcher is Mlog(N) 
> >correct?
> >Where N is the number of documents in the index and M is the number of 
> >indexes?
> >Is this about right?
> 
> No.
> 
> The added cost of a MultiSearcher is mostly proportional to just M, the 
> number of indexes.  The normal cost of searching is still mostly 
> proportional to N, the number of documents.  So M+N would probably be 
> more accurate.  There is a log(M) here and there, to, e.g., figure out 
> which index a doc id belongs to, but I doubt these are significant.
> 
> The significant costs of a MultiSearcher over an IndexSearcher are that 
> it adds more term dictionary reads (one per query term per index) and 
> more seeks (also one per query term per index, or two if you're using 
> phrases).
> 
> Doug
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 

-- 
Dror Matalon
Zapatec Inc 
1700 MLK Way
Berkeley, CA 94709
http://www.fastbuzz.com
http://www.zapatec.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message