lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From eks dev <eks...@yahoo.co.uk>
Subject Re: Lucene search performance on Sun UltraSparc T2 (T5120) servers
Date Wed, 18 Feb 2009 15:58:48 GMT
Have you tried NGram SpellChecker + Query expansion?  This is quite similar to your proposal,
you have your priority queue in SpellChecker



----- Original Message ----
> From: mark harwood <markharw00d@yahoo.co.uk>
> To: java-user@lucene.apache.org
> Sent: Wednesday, 18 February, 2009 11:54:18
> Subject: Re: Lucene search performance on Sun UltraSparc T2 (T5120) servers
> 
> 
> I was having some thoughts recently about speeding up fuzzy search.
> 
> The current system does edit-distance on all terms A-Z, single threaded. Prefix 
> length can reduce the search space and there is a "minimum similarity" threshold 
> but that's roughly where we are. Multithreading this to make use of multiple 
> CPUs is one option to look at but I was mainly thinking about smarter ways to do 
> the fuzzy scan:
> 
> I had the notion that we could move to a solution where a priority queue keeps 
> the "best matches so far" and as you progress through the termEnum you could 
> bail out of edit distance calculations quickly using a rough(cheap) assessment 
> of if the current term is likely to make the cut (i.e. beat the current lowest 
> score in the priority queue). It would make sense to populate the priority queue 
> ASAP with terms that are most likely to be the best matches and these will be 
> the ones that share a reasonable length prefix.
> As an example - searching for Obama~
> 
> 1) Create "best matches" priority queue
> 2) Scan all terms from oba to obz populating priority queue
> 3) Scan all terms from "a" to "oba" and "obz" to "z", exiting quickly if the 
> term fails to meet lowest score in the priority queue.
> 
> How we "exit quickly" and how we determine what prefix to use in 2) are to be 
> determined but the principle seems reasonable
> 
> Thoughts?
> 
> 
> 
> 
> ----- Original Message ----
> From: Varun Dhussa 
> To: java-user@lucene.apache.org
> Sent: Wednesday, 18 February, 2009 10:36:07
> Subject: Lucene search performance on Sun UltraSparc T2 (T5120) servers
> 
> Hi,
> 
> I have had a bad experience when migrating my application from Intel Xeon based 
> servers to Sun UltraSparc T2 T5120 servers. Lucene fuzzy search just does not 
> perform. A search which took approximately 500 ms takes more than 6 seconds to 
> execute.
> 
> The index has about 100,000,000 records. So, I tried to split it into 10 indices 
> and used the ParallelSearcher on it, but still got similar results.
> 
> I am guessing that this is because the distance implementation used by Lucene 
> requires higher clock speed and can't be parallelized much.
> 
> Please advice
> 
> -- Varun Dhussa
> Product Architect
> CE InfoSystems (P) Ltd
> http://www.mapmyindia.com
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org



      

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message