lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Rochkind <rochk...@jhu.edu>
Subject RE: bi-grams for common terms - any analyzers do that?
Date Sun, 26 Sep 2010 00:21:43 GMT
Huh, okay, I didn't know that #2 happened at all. Can you explain or point me to documentation
to explain when it happens?  I'm afraid I'm having trouble understanding <<  if the
analyzer returns more than one position back from a "queryparser token" (whitespace). >>

Not entirely sure what that means.  Can you give an example?

As much as the query parser pre-tokenization is a problem in many cases (for me too), I'm
not sure if dismax could happen without some pre-tokenization, doesn't it need that so it
can combine the scores of the individual words by "maximum disjunction" -- it's got to know
what the individual terms are, if it's going to dismax combine them, no?  

I'm not sure if "the queryparser forms a phrase query without explicit phrase quotes" is a
problem for me, I had no idea it happened until now, never noticed, and still don't really
understand in what circumstances it happens. 

Jonathan
________________________________________
From: Robert Muir [rcmuir@gmail.com]
Sent: Saturday, September 25, 2010 10:58 AM
To: solr-user@lucene.apache.org
Subject: Re: bi-grams for common terms - any analyzers do that?

On Sat, Sep 25, 2010 at 10:33 AM, Jonathan Rochkind <rochkind@jhu.edu>wrote:

> Wow, I never heard of autoGeneratePhraseQueries before. Is there any
> documentation of what it does?
>
> My initial reaction is being confused because this sounds kind of like the
> opposite of hte original issue. The original issue is that the query parsers
> are splitting on whitespace _before_ they give tokens to the field
> analyzers.  The query parsers actually do this only with queries that are
> NOT explicit phrase queries.  I woudln't call this behavior "automatically
> generating phrase queries" exactly, and wouldn't expect that turning off
> "automatic generating of phrase queries" would prevent the pre-tokenization
> by the query parser.  But... it does somehow?
>

this is in reference to Tom's comment on his "l'art" problem (
http://www.hathitrust.org/blogs/large-scale-search/tuning-search-performance
 ).

so, there are two problems:
1. that the queryparser "pre-tokenizes" on whitespace at all.
2. that the queryparser forms a phrase query, if the analyzer returns more
than one position back from a "queryparser token" (whitespace).

turning off autoGeneratePhraseQueries only solves problem #2, because its
not appropriate for many languages. Before this option (e.g. Solr 1.4.x),
you had to use the PositionFilter to workaround this problem. But
PositionFilter simply "flattens/stacks" the positions (makes it seem as if
they are all synonyms). With PositionFilter you couldn't have phrase queries
at all... and you don't get a BooleanQuery coordination factor.

with autoGeneratePhraseQueries=false, you won't get a phrase query unless it
was in double quotes... its just that simple.

fixing problem #1 alltogether, is the way to go. Because then the
tokenization would be left to the analyzer completely, and you would have a
lot more flexibility: https://issues.apache.org/jira/browse/LUCENE-2605

--
Robert Muir
rcmuir@gmail.com

Mime
View raw message