lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steven Schlansker <ste...@likeness.com>
Subject Re: PrefixQuery with short prefix does not match documents
Date Wed, 29 May 2013 00:20:11 GMT
Hi Mike,

Thank you for the pointer, that is indeed the cause here.
The reason I added the rewrite was to preserve the boost of the field on matches.
Specifically, some results have a field boost of log(popularity) and others have a field boost
of 100 to float them to the top.

Without the rewriter, all matches get the same score, so the results are more or less arbitrary.
It seems that I cannot expect to get scored results for more than booleanMaxClauseCount on
a prefix query, at least from my reading of MultiTermQuery's nested classes.

Is there a better way to indicate "popularity" than field boosts, that might work with PrefixQuery?
 Or am I asking for the impossible here?



The EdgeNGramFilter looks very interesting, and I suspect it is basically exactly what I want.
 But I am going to be expected to ship this to production soon.  How confident are you of
the quality of the current patch?  I am willing to deal with some level of pain (it's pretty
clear that I'd have to redo the way that it does indexing, for example, to read from my data
source instead of a fileā€¦) but I am going to look like a fool if it crashes all over the
place :-)

Thanks,
Steven

On May 25, 2013, at 8:44 AM, Michael McCandless <lucene@mikemccandless.com> wrote:

> I suspect this is because you set TopTermsScoringBooleanQueryRewrite
> method on the PrefixQuery: this will keep "only" the top 10K terms, so
> if g* matches more than 10K terms, some terms are dropped.
> 
> You may want to index short prefixes into the index instead, e.g.
> using EdgeNGramFilter, and then cutover to PrefixQuery when the prefix
> is "long enough".
> 
> This is the approach I took with the index-based suggester on
> https://issues.apache.org/jira/browse/LUCENE-4845 ...
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> 
> On Fri, May 24, 2013 at 7:06 PM, Steven Schlansker <steven@likeness.com> wrote:
>> Hi everyone,
>> 
>> I am building an autocomplete index.  The index contains both the names and a small
set of fixed types.
>> The intention is that type matches will always come first, followed by name matches.
>> 
>> I am using a PrefixQuery to do substring matching.  Confusingly, I am finding that
very short prefix
>> matches sometimes will return no results when combined with an additional filter.
>> 
>> For example, I have a document "body:german type:TYPE".  The query "+(type:TYPE)
+body:ge*" matches this document.
>> The query "+(type:TYPE) +body:g*" does not.  Double confusingly, it works fine in
Luke -- just not when I build the query by hand.
>> 
>> Here is how I create the document:
>> 
>> Document doc = new Document();
>> doc.add(new Field("body", "German", TextField.TYPE_STORED));
>> doc.add(new Field("type", "TYPE", StringField.TYPE_STORED));
>> 
>> Here is how I build the query:
>> 
>> Query allowedTypes = new BooleanQuery();
>> allowedTypes.add(new TermQuery(new Term("type", "TYPE")), Occur.SHOULD);
>> 
>> 
>> Query prefixQuery = new PrefixQuery(new Term("body", "ge"));
>> prefixQuery.setRewriteMethod(new MultiTermQuery.TopTermsScoringBooleanQueryRewrite(10000));
>> 
>> Query mainQuery = new BooleanQuery();
>> mainQuery.add(allowedTypes, Occur.MUST);
>> mainQuery.add(prefixQuery, Occur.MUST);
>> 
>> Am I missing something obvious?
>> 
>> Thanks,
>> Steven Schlansker
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message