lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Friedland, Zachary (EDS - Strategy)" <zachary_friedl...@ml.com>
Subject RE: Ideal Index Fragmentation
Date Wed, 31 Aug 2005 01:53:55 GMT
Chris,

 

Thanks for your comments -- it's great to hear that people have had success with very large
indexes.

 

I'll be running on a 4-CPU (3.8GHz, 2GB RAM) Windows 2000 box, so hopefully I'll get some
advantages with the ParallelMultiSearcher... If anyone has some metrics to post on using the
ParallelMultiSearcher on a multi-CPU box, I'd really appreciate it.

 

More assorted questions:

*	To summarize from below: I've got lots of distinct objects, each with lots of different
fields (often with different boosts).  I have tried having the server be aware of all the
indexes and construct large OR queries, but that takes tons of memory to run these queries.
 As an alternative, I've concatenated all the core searchable fields into groups based on
their boosts.  So I end up with the same five fields in all indexes, and adjust the boost
accordingly (SearchAll130, SearchAll120, SearchAll100, SearchAll90, SearchAll75).  This works
pretty well in terms of results and is better with memory.  Has anybody else tried to do anything
like this - how did it work out?  Is there a better strategy to search lots of distinct fields
in lots of indexes?
*	I have been reading the posts on using Filter vs. BooleanQuery.  To implement a search-within-a-search,
it seems the Filter is advantageous due to its cacheability, but are there other pros or cons
that should be considered (memory, speed, etc).
*	I'm interested in implementing a "dynamic filter" component that will walk through the hits[]
object and pull out distinct values for certain fields to display as search-within-a-search
options (all of them will return at least one result since they are in the hits[]).  Has anyone
implemented something like this -- how did it work out?
*	When using a ParallelMultiSearcher, is there an easy way to know how many matches came from
each index searched?  I'd like to be able to display how many of each object are in the combined
hits[].  Since I'm storing one object type per index, the count of hits from each index will
give me that number.
*	Does anyone know when 1.9/2.0 will be released?

Sorry for the long post....

Thanks,

Zach

 

 

	-----Original Message----- 
	From: Chris Lamprecht [mailto:clamprecht@gmail.com] 
	Sent: Tue 8/30/2005 8:01 PM 
	To: java-user@lucene.apache.org 
	Cc: 
	Subject: Re: Ideal Index Fragmentation
	
	

	Zach,
	It probably won't help performance to split the index and then search
	it on the same machine unless you search the indexes in parallel (with
	a multiprocessor or multi-core machine).  Even in this case, the disk
	is often a bottleneck, essentially preventing the search from really
	running in parallel.  Although if your index is in the filesystem
	cache this may be fast.
	
	Notice I said "probably", "often", "may be" in the above paragraph --
	you just have to performance test it and measure it to see.  Start
	with a simple, one-index setup and see if that works.  A 2GB index
	isn't itself a problem (under linux at least).    Then your queries
	and any pre/post-query processing become the bigger factors.  I
	haven't run into any filesize limits under linux with indexes up to
	12GB (and others here with even larger indexes).  The one exception is
	using Lucene 1.9's MMapDirectory-- it's limited to (I believe) 4GB on
	a 32-bit platform.
	
	
	On 8/30/05, Friedland, Zachary (EDS - Strategy)
	<zachary_friedland@ml.com> wrote:
	> Does anyone have experience using lots of indexes simultaneously with
	> the multisearcher?  I'm looking to index 15 distinct objects for
	> searching, and was thinking of creating 15 distinct indexes for better
	> manageability & performance (for certain searches when I know which
	> index to search).
	>
	> Certain indexes will be very large (2-3 million documents), but most
	> will be 50,000-500,000 documents.  Each document will contain a good
	> number of fields (20-50).
	>
	> I know there are issues with file handles, but what happens to search
	> performance (parallel vs. sequential) over lots of indexes.  Am I better
	> off combining them into one huge index (2GB file limit may be an issue),
	> or some fragmentation in-between....
	>
	> Any advice that you have had with a similar project would be greatly
	> appreciated.
	>
	> Thanks,
	> Zach
	> --------------------------------------------------------
	>
	> If you are not an intended recipient of this e-mail, please notify the sender, delete
it and do not read, act upon, print, disclose, copy, retain or redistribute it. Click here
for important additional terms relating to this e-mail.     http://www.ml.com/email_terms/
	> --------------------------------------------------------
	>
	> ---------------------------------------------------------------------
	> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
	> For additional commands, e-mail: java-user-help@lucene.apache.org
	>
	>
	
	---------------------------------------------------------------------
	To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
	For additional commands, e-mail: java-user-help@lucene.apache.org
	
	

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message