lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: Optimization
Date Tue, 11 Oct 2005 02:03:03 GMT
Tom,

Very cool!  Thanks for sharing your technique, which works well for  
prefixed and suffixed wildcard queries.  However, it doesn't address  
an * in the middle of a term, say W*D.  Obviously your usage doesn't  
require better performance for a wildcard in the middle, so you've  
done well - I just wanted to point out the one caveat for others.  A  
prefixed wildcard is the worst performer, though, so you've nipped  
the major one.

     Erik

On Oct 7, 2005, at 9:17 AM, Aigner, Thomas wrote:

> Thanks Erik, I tried the reverse index and it worked like a charm.
> While I was doing this, we figured out a way to handle contains within
> search and wildcard searches at the beginning.  I thought I would  
> share
> it with the community (and realized it handled the reverse index as
> well)
>
> Word: ABCDEFG
>
> Tokens created:
>     <ABCDEFG
>     BCDEFG
>     CDEFG
>     DEFG
>     EFG
>     FG
>
> What I do is if the search string is :
>     WORD*   I search for <WORD*
>     *WORD   I search for WORD*
>     *WORD*  I search for WORD*
>       WORD    I search for <WORD
>
> With this technique, the search result time was decreased tremendously
> for contains within and wildcard searches from the beginning.  The  
> index
> has become 5X as large and takes longer to build, but I'm willing to
> sacrifice disk space and time for this huge benefit of speed.  Also, I
> have taken the wildcard query completely out of the program now so
> everything uses my customized analyzer.
>
> Tom
>
> -----Original Message-----
> From: Erik Hatcher [mailto:erik@ehatchersolutions.com]
> Sent: Wednesday, October 05, 2005 9:27 AM
> To: java-user@lucene.apache.org
> Subject: Re: Optimization
>
>
> On Oct 5, 2005, at 9:05 AM, Aigner, Thomas wrote:
>
>>     Have a question.. Is there any obvious things that can be done
>> to help speed up query lookups especially wildcard searches (i.e.
>> *lamps).
>>
>
> Obvious?  Sort of.  *lamps needs to scan through _every_ single term
> in the index (for the specified field only, of course) because terms
> are lexicographically ordered.
>
> If you reverse terms during analysis and lay them in the same
> position (increment 0) as the original token you'd end up with
> "spmal..." terms.  Now pre-process the query string and if there is a
> prefixed wildcard query, reverse it so that "*lamps" turns into
> "spmal*" and you will likely achieve a dramatic speed-up.
>
> This is just one technique for dealing with prefixed wildcard
> queries.  There is more fun to be had with queries like *lamps*.  A
> technique I learned from the book Managing Gigabytes is to rotate
> terms through all their possible variations and index all of those,
> which also requires cleverness on the querying side of things.
>
>      Erik
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message