lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <>
Subject Re: [jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml
Date Thu, 19 May 2011 01:50:05 GMT
+1. I've seen far too many implementations of Solr that blindly use
the example configurations and then wonder why the results are
surprising (WordDelimiterFilterFactory by itself has confused more
people than I can recollect).

Although, just to contradict myself, I guess if people don't really
look at the configs, they deserver the consequences...

And to contra-contradict myself, at least that would give us a clue on
the user's list about where to look first!


2011/5/18 Jan Høydahl (JIRA) <>:
>    [
> Jan Høydahl commented on SOLR-2519:
> -----------------------------------
> Largely agree with @Hoss' suggestion. But I think it would be wise to emphasize that
the example schema is just that - an *example* - encouraging people to create new fieldTypes
instead of editing the example ones. It's not a problem for "int", "date" etc, but for text
I always encourage our customers and students to stay away from the FieldTypes in the example
and make their own versions instead.
> One way to further encourage this best practice is naming all text FieldTypes clearly
as examples, e.g.
> {code}
> <fieldType name="text_example_en" ..>
> <fieldType name="text_example_generic" ..>
> {code}
> We must realize that a lot of non-american users out there are already customizing their
schemas with the naming pattern "text_<lang>", which means you'll find "text_en", "text_it",
"text_no" in a lot of installations. Therefore it would be un-wise to introduce new FieldTypes
wich crashes with those names out of the box in version 3.2, thus include _example in the
type name.
> When upgrading, I always leave all the example field types intact, and add my custom
ones separately, clearly marked by comments for easy copy/paste. I believe this to be a fairly
common practice, and wanted as well, which would give no clashes for the above example.
> With this example naming practice, we can be pretty sure that if people talk about the
fieldType "text_example_en" on the lists, they mean the default example type, but if they
talk about "text_en", it's something they've customized themselves (if so by simply renaming
the example). It'll be more mental resitance for people to start modifying something with
"_example" in it wihout also changing the name.
>> Improve the defaults for the "text" field type in default schema.xml
>> --------------------------------------------------------------------
>>                 Key: SOLR-2519
>>                 URL:
>>             Project: Solr
>>          Issue Type: Bug
>>            Reporter: Michael McCandless
>>            Assignee: Michael McCandless
>>             Fix For: 3.2, 4.0
>>         Attachments: SOLR-2519.patch
>> Spinoff from:
>> The text fieldType in schema.xml is unusable for non-whitespace
>> languages, because it has the dangerous auto-phrase feature (of
>> Lucene's QP -- see LUCENE-2458) enabled.
>> Lucene leaves this off by default, as does ElasticSearch
>> (http://
>> Furthermore, the "text" fieldType uses WhitespaceTokenizer when
>> StandardTokenizer is a better cross-language default.
>> Until we have language specific field types, I think we should fix
>> the "text" fieldType to work well for all languages, by:
>>   * Switching from WhitespaceTokenizer to StandardTokenizer
>>   * Turning off auto-phrase
> --
> This message is automatically generated by JIRA.
> For more information on JIRA, see:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message