lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Leangen <apa...@leangen.net>
Subject Re: How can I limit the number of hits in my query?
Date Fri, 18 May 2007 04:18:20 GMT

Thank you, Erick, this is very useful!

Have you ever taken a look at Google Suggest[1]? It's very fast, and the
results are impressive. I think your suggestion will go a long way to
fixing my problem, but there's probably still quite a gap between this
approach and the kind of results that Google Suggest provides.

I wonder how it could be possible to do the same with Lucene...

Anyway, thanks a lot for the help!


[1] http://www.google.com/webhp?complete=1&hl=en


On Tue, 2007-05-15 at 12:46 -0400, Erick Erickson wrote:
> OK, I'm going to go on the assumption that all you're interested in
> is auto completion, so don't try to generalized this to queries......
> 
> Don't use queries, PrefixQuery or otherwise. Use one of the 
> TermEnums, probably WildcardTermEnum. What that will do is 
> allow you to find all the terms, in lexical order, that match 
> your fragment without using queries at all. This has several 
> advantages.... 
> 
> 1> it's fast. It doesn't require you to do anything but march down
>      some index list.
> 2> it doesn't expand the terms prior to processing. No such thing
>     as "TooManyClauses". Perhaps OutOfMemory if you try to 
>     return 100,000,000 terms <G>....
> 3> you can stop whenever you've accumulated "enough" terms,
>      where "enough" is up to you.
> 
> NOTE: there's also a RegexTermEnum in the 
> contrib section (last I knew, it may be in core in 2.1) that 
> allows arbitrary regex enumerations, but it's significantly
> slower than WildcardTermEnum, which is hardly surprising
> since it has to do more work.......
> 
> It's reasonable to ask how much use auto-completion is when the
> poor user has, say, 10,000 terms to choose from, so I think it's
> entirely reasonable to get the first, say, 100 terms and quit. You 
> should be able to do something like this quite easily with the Enums.
> 
> I think your original solution of using queries would not be
> satisfactory for the user anyway, *assuming* that the user is
> as impatient as I am and wants some auto-complete options 
> RIGHT NOW <G>, even if you solved the TooManyClauses
> issue.
> 
> Along the same lines, another question is whether you should
> try to auto-complete when the user has typed less than, say,
> 3 characters, but that's your design decision..... 
> 
> Really, try the WildcardTermEnum. It's pretty neat.
> 
> Hope this helps
> Erick
> 
> On 5/14/07, David Leangen <apache@leangen.net> wrote:
>         
>         Thank you very much for this. Some more questions inline... 
>         
>         
>         >
>         >         - How can I limit the number of hits? I don't know
>         in
>         >            advance what the data will be, so it's not
>         feasible for
>         >            me to use RangeQuery.
>         >
>         >
>         > You can use a TopDocs or a HitCollector object which allows
>         you
>         > to process each object as it's hit. But I doubt you need to
>         do this.
>         
>         > No.  I expect you're using a wildcard, and wildcard handling
>         is 
>         > complicated.
>         
>         
>         Ok, you're right. It's not the limiting of the results that's
>         the
>         problem, it's the way the search is expanded.
>         
>         Since this is an autocomplete, when the user types, for
>         example "a" or a 
>         Japanese character "あ", I am using PrefixFilter for this, so
>         I guess
>         the search turns into "a*" and "あ*" respectively.
>         
>         In the archive, the related posts I read either refer to a
>         DateRange 
>         (where it is possible to search first by year, then month...
>         etc.), or
>         they suggest to increase the max count.
>         
>         Neither of these solutions work in my case... It's not a date,
>         and I
>         have no idea of the results in advance and it would not be
>         practical or 
>         elegant to speculate on the results (for example first try
>         aa*~ab* and
>         see what that gives, etc.).
>         
>         I can get access to the "weight" values of the terms (a data
>         field
>         determined by their frequency of use), so I'll try something
>         related to 
>         that. For people with more experience, would that be a good
>         path to
>         take?
>         
>         Otherwise, would a reasonable solution be to override or
>         re-implement
>         PrefixFilter?
>         
>         
>         Thank you so much!
>         David
>         
>         
>         
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message