lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml
Date Fri, 27 May 2011 08:41:47 GMT

    [ https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13040129#comment-13040129
] 

Robert Muir commented on SOLR-2519:
-----------------------------------

A few opinions:

1. First of all, I am +1 to the patch. I think its an improvement overall, however I think
it might be worthwhile to discuss the following issues below.

2. I think we need to stop kidding ourselves about example/default and just recognize that
99.99999999999% of users just use the example as their default configuration. Guys, the example
is the default, there is simply not argument, this is the reality!  So I think we should present
reasonable field type names such as text_en etc. Please don't waste any more of our time trying
to convince users that the default is actually an example, its a default.

3. The aggressive analysis is totally unnecessary and gives bad results, this is not 1985...
Lets drop the porter stemmer and the stopwords list and replace them with less aggressive
defaults such as s-stemmer and a commongrams configuration.

4. I do not think the default query parser should be the lucene one, if we have a fancy one
(edismax?) that happily handles user input without exceptions... why not just default to the
best we have to offer?!


> Improve the defaults for the "text" field type in default schema.xml
> --------------------------------------------------------------------
>
>                 Key: SOLR-2519
>                 URL: https://issues.apache.org/jira/browse/SOLR-2519
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.2, 4.0
>
>         Attachments: SOLR-2519.patch, SOLR-2519.patch, SOLR-2519.patch
>
>
> Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5
> The text fieldType in schema.xml is unusable for non-whitespace
> languages, because it has the dangerous auto-phrase feature (of
> Lucene's QP -- see LUCENE-2458) enabled.
> Lucene leaves this off by default, as does ElasticSearch
> (http://http://www.elasticsearch.org/).
> Furthermore, the "text" fieldType uses WhitespaceTokenizer when
> StandardTokenizer is a better cross-language default.
> Until we have language specific field types, I think we should fix
> the "text" fieldType to work well for all languages, by:
>   * Switching from WhitespaceTokenizer to StandardTokenizer
>   * Turning off auto-phrase

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message