lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-3723) Improve OOTB behavior: English word-splitting should default to autoGeneratePhraseQueries=true
Date Thu, 09 Aug 2012 17:06:19 GMT

    [ https://issues.apache.org/jira/browse/SOLR-3723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13431979#comment-13431979
] 

Michael McCandless commented on SOLR-3723:
------------------------------------------

I think apps that want this behaviour should simply use
text_en_splitting.  That's why we have that field type.

I don't think we should turn on auto-phrase for text_en (and
definitely not for text_general, breaking entire languages): there are
serious downsides (as Robert enumerated).

I was curious how ElasticSearch handles text by default, so I indexed
text 北京医科大学 and then searched for 北京大学and it does match:
good (ie matching text_general).

I also indexed fly-fishing and confirmed fly, fishing and fly-fishing
all match (like text_general).

You can of course go and customize your analysis chain in ElasticSearch
(http://www.elasticsearch.org/guide/reference/index-modules/analysis ), and
set options like auto_generate_phrase_queries for the query parser:
http://www.elasticsearch.org/guide/reference/query-dsl/query-string-query.html 
if you want to get the same behavior as text_en_splitting.
                
> Improve OOTB behavior: English word-splitting should default to autoGeneratePhraseQueries=true
> ----------------------------------------------------------------------------------------------
>
>                 Key: SOLR-3723
>                 URL: https://issues.apache.org/jira/browse/SOLR-3723
>             Project: Solr
>          Issue Type: Improvement
>          Components: Schema and Analysis
>    Affects Versions: 3.4, 3.5, 3.6, 4.0-ALPHA, 3.6.1
>            Reporter: Jack Krupansky
>
> Digging through the Jira and revision history, I discovered that back at the end of May
2011, a change was made to Solr that fairly significantly degrades the OOTB behavior for English
Solr queries, namely for word-splitting of terms with embedded punctuation, so that they end
up, by default, doing the OR of the sub-terms, rather than doing the obvious phrase query
of the sub-terms.
> Just a couple of examples:
> 1. CD-ROM => CD OR ROM rather than “CD ROM”
> 2. 1,000 => 1 OR 000 rather than “1 000” (when using the WordDelimiterFilter innocently
added to text_general or text_en)
> 3. out-of-the-box => out OR of OR the OR box rather than “out of the box”
> 4. 3.6 => 3 OR 6 rather than "3 6" (when using WordDelimiterFilter innocently added
to text_general or text_en)
> 5. docid-001 => docid OR 001 rather than "DOCID 001"
> All of those queries will give surprising and unexpected results.
> Note: The hyphen issue is present in StandardTokenizer, even if WDF is not used. Side
note: The full behavior of StandardTokenizer should be more fully documented on the Analyzers
wiki.
> Back to the history of the change, there was a lot of lively discussion on SOLR-2015
- add a config hook for autoGeneratePhraseQueries.
> And the actual change to default to the behavior described above was SOLR-2519 - improve
defaults for text_* field types.
> (Consider the entire discussion in those two issues incorporated here for reference.
Anyone wishing to participate in discussion on this issue would be well-advised to study those
two issues first.)
> I gather that the original motivation was for non-European languages, and that even some
European languages might search better without auto-phrase generation, but the decision to
default English terms to NOT automatically generate phrase queries and to generate OR queries
instead is rather surprising and unexpected and outright undesirable, as my examples above
show.
> I had been aware of the behavior for quite some time, but I had thought it was simply
a lingering bug so I paid little attention to it, until I stumbled across this autoGeneratePhraseQueries
"feature" while looking at the query parser code. I can understand the need to disable automatic
phrase queries for SOME languages, but to disable it by default for English seems rather bizarre,
as my simple use cases above show.
> Even if no action is taken on this Jira, I feel that it is important that there be a
wider awareness of the significant and unexpected impact from SOLR-2519, and that what had
seemed like buggy behavior was done intentionally.
> Unless there has been a change of heart since SOLR-2015/2519, I guess we are stuck with
the default TextField behavior, but at least we could improve the example schema in several
ways:
> 1. The English text field types should have autoGeneratePhraseQueries=true. If a user
innocently adds a word delimiter to text_en, for example, they need to know that autoGeneratePhraseQueries=true
is needed. Better to preempt that confusion and put the attribute in now. In fact, hyphenated
terms fail as I have noted above, so the addition is needed even if a WDF is not added.
> 2. Add commentary about the impact of autoGeneratePhraseQueries=true/false - in terms
of use case examples, as above. Specifically note the ones that will break with if the feature
is disabled.
> Another, more controversial change will be:
> 3. Change text_general to autoGeneratePhraseQueries=true so that English will be treated
reasonably by default. I suspect that most European languages will be at least "okay". A comment
will note that this field attribute should be removed or set to false for non-whitespace languages,
or that an alternative field type should be used. I suspect that the first thing any non-whitespace
language application will want to do is pick the text field type that has analysis that makes
the most sense for them, so I see no need to mess up English for no good reason.
> Make no mistake, #3 is the primary and only real goal of this OOTB 
> improvement. Maybe "text_general" could be kept as is for reference as the purported
"general" text field type (except that it doesn't work well for English, as shown above),
and maybe there should be a "text_default" that I would propose should be a literal copy of
text_en with commentary to direct users to the other choices for language.
> I would note that text_ja already has autoGeneratePhraseQueries=false, so I'm not sure
why the default in the TextField code had to be changed to false. Any languages for which
automatic phrase query generation is problematic should be attributed similarly. But, now
that it is wired into the schema defaults, we may be stuck with it.
> I was rather surprised that SOLR-2519 actually changed the default in TextField rather
than simply set the attribute as appropriate for the various text field types.
> There are probably also a couple of places in the wikis where the surprising behavior
should be noted. There is literally no wiki documentation for this important feature. There
are only two references to autoGeneratePhraseQueries, with no discussion of exactly what this
feature does or what the downside is if it is disabled.
> In the past, there was no need to document the treatment of embedded word delimiters
(well, okay, the poor handling for non-whitespace languages SHOULD have been documented),
but now there is no documentation of the degradation of what was a default and implicit feature
that a lot of people assume should be automatic.
> And, I would propose that the 4.0 CHANGES.TXT very clearly highlight the kinds of use
cases that unsuspecting users may not realize were BROKEN by the commit of SOLR-2519 that
is masked under the innocent phrasing of "improve defaults for text_* field types". How many
users seriously understood that a query with embedded dashes and commas behave differently
as a result of that change?
> I am contemplating whether to suggest that the WordDelimiterFilter should also be part
of the default text field type. Right now, it is hidden off in text_en_splitting.
> I think stemming should also be part of the default English field type. The whole point
of the "example" schema is to show-off the best of Lucene/Solr.
> I'm not quite ready to propose that English be the default language supported by the
example schema, but I am 99.999% certain that we should focus it on European, Roman, Latin
languages. Non-European languages are indeed important, and should probably have their own
schema. text_general was a good idea, but in hindsight it appears to have not been such a
great idea in light of the word-splitting problems I have highlighted above.
> Maybe I would propose that text_general be left as is, but that we add text_default which
is a copy of text_en (which would have WDF and stemming added) and fields use text_default
as their type. That way, it would be clear what is going on and users could sensibly see what
needs to happen if they wish to switch default languages.
> After discussion settles, a revised final proposal will be composed. And some specific
and non-controversial issues may be split into separate Jira issues.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message