lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml
Date Mon, 16 May 2011 19:06:47 GMT

    [ https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034203#comment-13034203
] 

Robert Muir commented on SOLR-2519:
-----------------------------------

As someone frustrated by this (but who would ultimately like to move past it and try to help
with solr's intl), I just wanted to say +1 to Hoss Man's proposal.

My only suggestion on what he said is that I would greatly prefer text_en over text_western
or whatever for these reasons:
1. the stemming and stopwords and crap here are english.
2. for other western languages, even if you swap these out to be say, french or italian (which
is the seemingly obvious way to cut over), the whole WDF+autophrase is still a huge trap (see
http://www.hathitrust.org/blogs/large-scale-search/tuning-search-performance for an example).
in this case use of ElisionFilter can be taken to avoid it.

> Improve the defaults for the "text" field type in default schema.xml
> --------------------------------------------------------------------
>
>                 Key: SOLR-2519
>                 URL: https://issues.apache.org/jira/browse/SOLR-2519
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.2, 4.0
>
>         Attachments: SOLR-2519.patch
>
>
> Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5
> The text fieldType in schema.xml is unusable for non-whitespace
> languages, because it has the dangerous auto-phrase feature (of
> Lucene's QP -- see LUCENE-2458) enabled.
> Lucene leaves this off by default, as does ElasticSearch
> (http://http://www.elasticsearch.org/).
> Furthermore, the "text" fieldType uses WhitespaceTokenizer when
> StandardTokenizer is a better cross-language default.
> Until we have language specific field types, I think we should fix
> the "text" fieldType to work well for all languages, by:
>   * Switching from WhitespaceTokenizer to StandardTokenizer
>   * Turning off auto-phrase

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message