lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Høydahl (JIRA) <j...@apache.org>
Subject [jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml
Date Wed, 18 May 2011 22:22:47 GMT

    [ https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13035796#comment-13035796
] 

Jan Høydahl commented on SOLR-2519:
-----------------------------------

Largely agree with @Hoss' suggestion. But I think it would be wise to emphasize that the example
schema is just that - an *example* - encouraging people to create new fieldTypes instead of
editing the example ones. It's not a problem for "int", "date" etc, but for text I always
encourage our customers and students to stay away from the FieldTypes in the example and make
their own versions instead.

One way to further encourage this best practice is naming all text FieldTypes clearly as examples,
e.g. 

{code}
<fieldType name="text_example_en" ..>
<fieldType name="text_example_generic" ..>
{code}

We must realize that a lot of non-american users out there are already customizing their schemas
with the naming pattern "text_<lang>", which means you'll find "text_en", "text_it",
"text_no" in a lot of installations. Therefore it would be un-wise to introduce new FieldTypes
wich crashes with those names out of the box in version 3.2, thus include _example in the
type name.

When upgrading, I always leave all the example field types intact, and add my custom ones
separately, clearly marked by comments for easy copy/paste. I believe this to be a fairly
common practice, and wanted as well, which would give no clashes for the above example.

With this example naming practice, we can be pretty sure that if people talk about the fieldType
"text_example_en" on the lists, they mean the default example type, but if they talk about
"text_en", it's something they've customized themselves (if so by simply renaming the example).
It'll be more mental resitance for people to start modifying something with "_example" in
it wihout also changing the name.

> Improve the defaults for the "text" field type in default schema.xml
> --------------------------------------------------------------------
>
>                 Key: SOLR-2519
>                 URL: https://issues.apache.org/jira/browse/SOLR-2519
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.2, 4.0
>
>         Attachments: SOLR-2519.patch
>
>
> Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5
> The text fieldType in schema.xml is unusable for non-whitespace
> languages, because it has the dangerous auto-phrase feature (of
> Lucene's QP -- see LUCENE-2458) enabled.
> Lucene leaves this off by default, as does ElasticSearch
> (http://http://www.elasticsearch.org/).
> Furthermore, the "text" fieldType uses WhitespaceTokenizer when
> StandardTokenizer is a better cross-language default.
> Until we have language specific field types, I think we should fix
> the "text" fieldType to work well for all languages, by:
>   * Switching from WhitespaceTokenizer to StandardTokenizer
>   * Turning off auto-phrase

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message