lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: I just don't get wildcards at all.
Date Sat, 08 Apr 2006 11:04:59 GMT
Eric,

Wildcard queries are tricky business.  WildcardQuery by itself  
without leveraging any analysis tricks is what you've got, but you  
may want to consider injecting rotated tokens.  For example, the word  
cat would be indexed as "cat$", "at$c", "t$ca", and "$cat" (all in  
the same position, increment 0).  That's half the equation.  The  
other half is to adjust the queries so that if someone searches for  
c*t that it becomes a WildcardQuery (or PrefixQuery in this case) for  
t$c*, making the search space much smaller.

CSRQ definitely isn't what you want for wildcard queries.  Another  
alternative is to create a custom Filter, if its reasonable to  
extract wildcarded clauses from a query expression, that can  
enumerate terms as efficiently as possible (like WildcardTermEnum  
does) and lights up only the documents that contain matching terms -  
this would eliminate the TooManyClauses headache.

There really isn't anything pre-built that does what you're after any  
better than the suggestions above, I don't think.

	Erik


On Apr 7, 2006, at 10:06 AM, Erick Erickson wrote:

> OK, I know I'm asking you to write my code for me (or at least  
> point me to
> an example), but I'm at my wits end, so please rescue me....
>
> This is a reprise of TooManyClauses. We have a large amount of  
> text, and a
> requirement to do a wildcard query. Of course, it's waaaay too big  
> to use
> Wildcard or the other "expanding" queries. They frighten me  
> anyway.....
>
> y'all pointed me at the ConstantScoreRangeQuery (CSRQ), but  
> actually using
> it is not making sense to me.
>
> I just don't get how, for instance,  CSRQ helps me that much. Say I  
> want to
> search for big*er. I can use a CSRQ to get all the docs that  
> include this
> term, just by using biga and bigz as my min/max terms. But then I'm  
> stuck. I
> could iterate through all the docs returned, but that seems  
> inefficient. Not
> to mention that the HitCollector (?) class warns against this due  
> to "an
> order of magnitude" decrease in response time.
>
> What I *want* is a way to, for each doc in the CSRQ, get to answer  
> whether
> it's a match. Really, on the order of a callback with the value  
> that worked
> for the CSRQ and the ability to return a yes/no or a ranking.  
> Again, I can
> interate all the docs matched, but this seems expensive.
>
> Using filters doesn't really seem to do the trick for me either. If I
> understand them properly, they allow me to set up a bitset for all the
> documents that should be searched. All 1,000,000 of them? Or am I  
> thinking
> about this completely backwards? I have LIA, but I'm also wondering if
> there's something in 1.9 that I haven't found yet.
>
> Now, given how easy the rest of Lucene is to use, I assume that I'm
> approaching this poorly, but I sure am stumped.
>
> All that said, I'm quite Java-naieve, so please bear with me if this
> question demonstrates my ignorance painfully.....
>
> Thanks
> Erick


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message