lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml
Date Fri, 27 May 2011 16:00:47 GMT

    [ https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13040298#comment-13040298
] 

Michael McCandless commented on SOLR-2519:
------------------------------------------

bq. I think we need to stop kidding ourselves about example/default and just recognize that
99.99999999999% of users just use the example as their default configuration. Guys, the example
is the default, there is simply not argument, this is the reality! So I think we should present
reasonable field type names such as text_en etc. Please don't waste any more of our time trying
to convince users that the default is actually an example, its a default.

OK I agree.  So I'll rename the fields back to text_XX (instead of text_example_XX).

bq. 3. The aggressive analysis is totally unnecessary and gives bad results, this is not 1985...
Lets drop the porter stemmer and the stopwords list and replace them with less aggressive
defaults such as s-stemmer and a commongrams configuration.

Sounds great!  Can you post the analyzer XML for this....?  Kinda out of my league at this
point :)

bq. 4. I do not think the default query parser should be the lucene one, if we have a fancy
one (edismax?) that happily handles user input without exceptions... why not just default
to the best we have to offer?!

+1

Robert maybe you can take the patch and iterate w/ these changes...?


> Improve the defaults for the "text" field type in default schema.xml
> --------------------------------------------------------------------
>
>                 Key: SOLR-2519
>                 URL: https://issues.apache.org/jira/browse/SOLR-2519
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.2, 4.0
>
>         Attachments: SOLR-2519.patch, SOLR-2519.patch, SOLR-2519.patch
>
>
> Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5
> The text fieldType in schema.xml is unusable for non-whitespace
> languages, because it has the dangerous auto-phrase feature (of
> Lucene's QP -- see LUCENE-2458) enabled.
> Lucene leaves this off by default, as does ElasticSearch
> (http://http://www.elasticsearch.org/).
> Furthermore, the "text" fieldType uses WhitespaceTokenizer when
> StandardTokenizer is a better cross-language default.
> Until we have language specific field types, I think we should fix
> the "text" fieldType to work well for all languages, by:
>   * Switching from WhitespaceTokenizer to StandardTokenizer
>   * Turning off auto-phrase

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message