lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Ludwig (JIRA)" <j...@apache.org>
Subject [jira] Commented: (SOLR-1204) Enhance SpellingQueryConverter to handle UTF-8 instead of ASCII only
Date Fri, 05 Jun 2009 20:58:07 GMT

    [ https://issues.apache.org/jira/browse/SOLR-1204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12716751#action_12716751
] 

Michael Ludwig commented on SOLR-1204:
--------------------------------------

Hi Robert, it looks like you're talking about this report:
http://unicode.org/reports/tr8/

The XML recommendation might also serve as a point of reference, although both a NAME and
a NMTOKEN are probably too liberal in what they accept (the colon among others):
http://www.w3.org/TR/REC-xml/#NT-Name

Shalin mentioned Java identifiers, probably without intending to ban names such as "null"
and "class":
http://java.sun.com/docs/books/jls/third_edition/html/lexical.html#40625

I'm not Unicode-savvy enough to know what's behind all these classes, and I'm not Solr-savvy
enough to know which of the punctuation characters that have special meaning in the Solr query
language should or must be excluded. So first the spec, then the implementation (regular expression,
or some schema technology to validate the field names in schema.xml, or whatever).

> Enhance SpellingQueryConverter to handle UTF-8 instead of ASCII only
> --------------------------------------------------------------------
>
>                 Key: SOLR-1204
>                 URL: https://issues.apache.org/jira/browse/SOLR-1204
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>    Affects Versions: 1.3
>            Reporter: Michael Ludwig
>            Assignee: Shalin Shekhar Mangar
>            Priority: Trivial
>             Fix For: 1.4
>
>         Attachments: SpellingQueryConverter.java.diff, SpellingQueryConverter.java.diff
>
>
> Solr - User - SpellCheckComponent: queryAnalyzerFieldType
> http://www.nabble.com/SpellCheckComponent%3A-queryAnalyzerFieldType-td23870668.html
> In the above thread, it was suggested to extend the SpellingQueryConverter to cover the
full UTF-8 range instead of handling US-ASCII only. This might be as simple as changing the
regular expression used to tokenize the input string to accept a sequence of one or more Unicode
letters ( \p{L}+ ) instead of a sequence of one or more word characters ( \w+ ).
> See http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html for Java regular
expression reference.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message