lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sven Duzont <>
Subject Re[2]: Index Partitioning ( was Re: Search deadlocking under load)
Date Mon, 11 Jul 2005 11:02:31 GMT

We are already using this design in production for a email job application system.
Each client (company) have an account and may have multiple users
When a new client is created, a new lucene index is automatically created when new job-applications
arrive for this account. 
Job applications are in principle owned by users, but some times they can share it with other
users in same account, so the search can be user-independent.
This design works fine for us as the flow of job applications is not the same for different
accounts. There are lucene indices that are more often updated than others.
It also permit us to rebuild one client index without impacting others

We have only one problem : when the index is updated and searched at the same time, the index
may be corrupted and an exception may be thrown by the indexer ("Read past OEF", i unfortunately
don't have the stack trace right now under my hand). I think that it is because the search
and indexation are made in two different java processes. We will rework the routines to lock
the search when an indexation is running and vice versa

--- sven

lundi 11 juillet 2005, 03:03:29, vous avez écrit:

PS> On 11/07/2005, at 10:43 AM, Chris Hostetter wrote:

>> : > Generally speaking, you only ever need one active Searcher, which
>> : > all of
>> : > your threads should be able to use.  (Of course, Nathan says that
>> : > in his
>> : > code base, doing this causes his JVM to freeze up, but I've  
>> never seen
>> : > this myself).
>> : >
>> : Thanks for your response Chris.  Do you think we are going down a
>> : deadly path by having "many smaller" IndexSearchers open rather than
>> : "one very large one"?
>> I'm sorry ... i think i may have confused you, i forgot  that this
>> thread
>> was regarding partioning the index.  i ment one searcher *per  
>> index* ...
>> don't try to make a seperate searcher per client, or have a pool of
>> searchers, or anything like that.  But if you have a need to partition
>> your data into multiple indexes, then have one searcher per index.

PS> Actually I think I confused you first, and then you confused me  
PS> back... Let me... uhh, clarify 'ourselves'.. :)

PS> My use of the word 'pool' was an error on my part (and a very silly
PS> one).  I should really have meant "LRU Cache".

PS> We have recognized that there is a finite # of IndexSearchers that
PS> can probably be open at one time.  So we'll use an LRU cache to make
PS> sure only the 'actively' in use Searchers are open.  However there
PS> will only be one IndexSearcher for a given physical Index directory
PS> open at a time, we're just making sure only the recently used ones
PS> are kept open to keep memory limits sane.

>> now assume you partition your data into two seperate indexes,  
>> unless the
>> way you partition your data lets you cleanly so that each of hte
>> two indexes contains only half the number of terms as if you had  
>> one big
>> index, then sorting on a field in those two indexes will require  
>> more RAM
>> then sorting on the same data in asingle index.

PS> Our data is logically segmented into Projects.  Each Project can  
PS> contain Documents and Mail.  So we currently have 2 physical Indexes
PS> per Project.  90% of the time our users work within one project at a
PS> time, and only work in "document mode" or "mail mode".  Every now and
PS> then they may need to do a general search across all Entities and/or
PS> Projects they are involved in (accomplished with Mulitsearcher).
PS> Perhaps we should just put Documents and Mail all in one Index for a
PS> project (ie have 1 Index per project)??

PS> Part of the reason in to partition is to make the cost of rebuilding
PS> a given project cheaper.  Reduces the risk of an Uber-Index being
PS> corrupted and screwing all the users up.  We can order the reindexing
PS> of projects to make sure our more important customers get re-indexed
PS> first if there is a serious issue.

PS> I would have thought that partitioning indexes would have performance
PS> benefits too:  a lot less data to scan (most of the data is already
PS> relevant).

PS> Since this isn't in production yet, I'd rather be proven wrong now
PS> rather than later! :)

PS> Thanks for your input.

PS> Paul
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message