lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hoss Man (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-2519) Improve the defaults for the "text" field type in default schema.xml
Date Mon, 16 May 2011 18:22:48 GMT

    [ https://issues.apache.org/jira/browse/SOLR-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034172#comment-13034172
] 

Hoss Man commented on SOLR-2519:
--------------------------------

I feel like we are convoluting two issues here: the "default" behavior of TextField, and the
example configs.

i don't have any strong opinions about changing the default behavior of TextField when {{autoGeneratePhraseQueries}}
is not specified in the {{<fieldType/>}} but if we do make such a change, it should
be contingent on the schema version property (which we should bump) so that people who upgrade
will get consistent behavior with their existing configs (TextField.init already has an example
of this for when we changed the default of {{omitNorms}})

as far as the example configs: i agree with yonik, that changing "text" at this point might
be confusing ... i think the best way to iterate moving forward would probably be:

* rename {{<fieldType name="text"/>}} and {{<field name="text"/>}} to something
that makes their purpose more clear (text_en, or text_western, or text_european, or some other
more general descriptive word for the types of languages were it makes sense) and switch all
existing {{<field/>}} declarations that currently use use field type "text" to use this
new name.

* add a new {{<fieldType name="text_general"/>}} which is designed (and documented to
be a general purpose field type when the language is unknown (it may make sense to fix/repurpose
the existing {{<fieldType name="textgen"/>}} for this, since it already suggests that's
what it's for)

* Audit all {{<field/>}} declarations that use "text_en" (or whatever name was chosen
above) and the existing sample data for those fields to see if it makes more sense to change
them to "text_general". also change any where based on usage it shouldn't matter.

The end result being that we have no {{<fieldType/>}} named "text" in the example configs,
so people won't get it confused with previous versions, and we'll have a new {{<fieldType/>}}
that works as well as possible with all langauges which we use as much as possible with the
example data.






> Improve the defaults for the "text" field type in default schema.xml
> --------------------------------------------------------------------
>
>                 Key: SOLR-2519
>                 URL: https://issues.apache.org/jira/browse/SOLR-2519
>             Project: Solr
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.2, 4.0
>
>         Attachments: SOLR-2519.patch
>
>
> Spinoff from: http://lucene.markmail.org/thread/ww6mhfi3rfpngmc5
> The text fieldType in schema.xml is unusable for non-whitespace
> languages, because it has the dangerous auto-phrase feature (of
> Lucene's QP -- see LUCENE-2458) enabled.
> Lucene leaves this off by default, as does ElasticSearch
> (http://http://www.elasticsearch.org/).
> Furthermore, the "text" fieldType uses WhitespaceTokenizer when
> StandardTokenizer is a better cross-language default.
> Until we have language specific field types, I think we should fix
> the "text" fieldType to work well for all languages, by:
>   * Switching from WhitespaceTokenizer to StandardTokenizer
>   * Turning off auto-phrase

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message