lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Smith <psm...@aconex.com>
Subject Re: Index Partitioning ( was Re: Search deadlocking under load)
Date Mon, 11 Jul 2005 01:03:29 GMT

On 11/07/2005, at 10:43 AM, Chris Hostetter wrote:

>
> : > Generally speaking, you only ever need one active Searcher, which
> : > all of
> : > your threads should be able to use.  (Of course, Nathan says that
> : > in his
> : > code base, doing this causes his JVM to freeze up, but I've  
> never seen
> : > this myself).
> : >
> : Thanks for your response Chris.  Do you think we are going down a
> : deadly path by having "many smaller" IndexSearchers open rather than
> : "one very large one"?
>
> I'm sorry ... i think i may have confused you, i forgot  that this  
> thread
> was regarding partioning the index.  i ment one searcher *per  
> index* ...
> don't try to make a seperate searcher per client, or have a pool of
> searchers, or anything like that.  But if you have a need to partition
> your data into multiple indexes, then have one searcher per index.

Actually I think I confused you first, and then you confused me  
back... Let me... uhh, clarify 'ourselves'.. :)

My use of the word 'pool' was an error on my part (and a very silly  
one).  I should really have meant "LRU Cache".

We have recognized that there is a finite # of IndexSearchers that  
can probably be open at one time.  So we'll use an LRU cache to make  
sure only the 'actively' in use Searchers are open.  However there  
will only be one IndexSearcher for a given physical Index directory  
open at a time, we're just making sure only the recently used ones  
are kept open to keep memory limits sane.

>
> now assume you partition your data into two seperate indexes,  
> unless the
> way you partition your data lets you cleanly so that each of hte
> two indexes contains only half the number of terms as if you had  
> one big
> index, then sorting on a field in those two indexes will require  
> more RAM
> then sorting on the same data in asingle index.
>

Our data is logically segmented into Projects.  Each Project can  
contain Documents and Mail.  So we currently have 2 physical Indexes  
per Project.  90% of the time our users work within one project at a  
time, and only work in "document mode" or "mail mode".  Every now and  
then they may need to do a general search across all Entities and/or  
Projects they are involved in (accomplished with Mulitsearcher).   
Perhaps we should just put Documents and Mail all in one Index for a  
project (ie have 1 Index per project)??

Part of the reason in to partition is to make the cost of rebuilding  
a given project cheaper.  Reduces the risk of an Uber-Index being  
corrupted and screwing all the users up.  We can order the reindexing  
of projects to make sure our more important customers get re-indexed  
first if there is a serious issue.

I would have thought that partitioning indexes would have performance  
benefits too:  a lot less data to scan (most of the data is already  
relevant).

Since this isn't in production yet, I'd rather be proven wrong now  
rather than later! :)

Thanks for your input.

Paul
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message