lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Improve OOTB behavior: English word-splitting should default to autoGeneratePhraseQueries=true
Date Thu, 09 Aug 2012 10:49:42 GMT
The text_general field type is meant to be a good default for all languages.

If you want English-specific behavior, you should use one of the
English field types (text_en, text_en_splitting,
text_en_splitting_tight).  The comments in schema.xml explain this.

Ideally would would eventually have default field types for many
different languages, not just English ... some day.

I don't think we should turn on autoGeneratePhraseQueries=true for
text_general: it's catastrophic to non-whitespace languages.

Mike McCandless

http://blog.mikemccandless.com

On Wed, Aug 8, 2012 at 8:13 PM, Jack Krupansky <jack@basetechnology.com> wrote:
> Digging through the Jira and revision history, I discovered that back at the
> end of May 2011, a change was made to Solr that fairly significantly
> degrades the OOTB behavior for Solr queries, namely for word-splitting of
> terms with embedded punctuation, so that they end up, by default, doing the
> OR of the sub-terms, rather than doing the obvious phrase query of the
> sub-terms.
>
> Just a couple of examples:
>
> CD-ROM => CD OR ROM rather than “CD ROM”
> 1,000 => 1 OR 000 rather than “1 000” (when using the WordDelimiterFilter)
> out-of-the-box => out OR of OR the OR box rather than “out of the box”
> 3.6 => 3 OR 6 rather than "3 6" (when using WordDelimiterFilter)
> docid-001 => docid OR 001 rather than "DOCID 001"
>
> All of those queries will give surprising and unexpected results.
>
> Back to the history of the change, there was a lot of lively discussion on
> SOLR-2015 - add a config hook for autoGeneratePhraseQueries:
> https://issues.apache.org/jira/browse/SOLR-2015
>
> And the actual change to default to the behavior described above was
> SOLR-2519 - improve defaults for text_* field types:
> https://issues.apache.org/jira/browse/SOLR-2519
>
> I gather that the original motivation was for non-European languages, and
> that even some European languages might search better without auto-phrase
> generation, but the decision to default English terms to NOT automatically
> generate phrase queries and to generate OR queries instead is rather
> surprising and unexpected and outright undesirable, as my examples above
> show.
>
> I had been aware of the behavior for quite some time, but I had thought it
> was simply a lingering bug so I paid little attention to it, until I
> stumbled across this autoGeneratePhraseQueries "feature" while looking at
> the query parser code. I can understand the need to disable automatic phrase
> queries for SOME languages, but to disable it by default for English seems
> rather bizarre, as my simple use cases above show.
>
> I'll file this as a Jira, but I wanted to call wider attention to it in case
> others were as unaware as me that what had seemed like buggy behavior was
> done intentionally.
>
> Unless there has been a change of heart since SOLR-2015/2519, I guess we are
> stuck with the default TextField behavior, but at least we could improve the
> example schema in several ways:
>
> 1. The English text field types should have autoGeneratePhraseQueries=true.
> 2. Add commentary about the impact of autoGeneratePhraseQueries=true/false -
> in terms of use case examples, as above. Specifically note the ones that
> will break with if the feature is disabled.
>
> Another, more controversial change will be:
>
> 3. Change text_general to autoGeneratePhraseQueries=true so that English
> will be treated reasonably by default. I suspect that most European
> languages will be at least "okay". A comment will note that this field
> attribute should be removed or set to false for non-whitespace languages, or
> that an alternative field type should be used. I suspect that the first
> thing any non-whitespace language application will want to do is pick the
> text field type that has analysis that makes the most sense for them, so I
> see no need to mess up English for no good reason.
>
> Make no mistake, #3 is the primary and only real goal of this OOTB
> improvement. Maybe "text_general" could be kept as is for reference as the
> purported "general" text field type (except that it doesn't work well for
> English. as shown above), and maybe there should be a "text_default" that I
> would propose should be text_en with commentary to direct users to the other
> choices for language.
>
> I would note that text_ja already has autoGeneratePhraseQueries=false, so
> I'm not sure why the default in the TextField code had to be changed to
> false. Any languages for which automatic phrase query generation is
> problematic should be attributed similarly. But, now that it is wired into
> the schema defaults, we may be stuck with it.
>
> I was rather surprised that SOLR-2519 actually changed the default in
> TextField rather than simply set the attribute as appropriate for the
> various text field types.
>
> There are probably also a couple of places in the wikis where the surprising
> behavior should be noted.
>
> And, I would propose that the 4.0 CHANGES.TXT very clearly highlight the
> kinds of use cases that unsuspecting users may not realize were BROKEN by
> the commit of SOLR-2519 that is masked under the innocent phrasing of
> "improve defaults for text_* field types". How many users seriously
> understood that a query with embedded dashes and commas behave differently
> as a result of that change?
>
> I am contemplating whether to suggest that the WordDelimiterFilter should
> also be part of the default text field type. Right now, it is hidden off in
> text_en_splitting.
>
> I'll file the Jira tomorrow. Feel free to hold off comments until the Jira
> appears.
>
> -- Jack Krupansky
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message