lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2094) Prepare CharArraySet for Unicode 4.0
Date Mon, 30 Nov 2009 23:06:20 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12783924#action_12783924
] 

Robert Muir commented on LUCENE-2094:
-------------------------------------

bq. But if the PhraseQuery is generated with QueryParser also preserving holes, then it works
properly?

what is "properly" ?

If I search on english for "book for sale", it will match "books for sale"
this is considered ok for english.

If I am using persian analyzer, such a thing will not work, because the plural form of book
(کتاب) is formed by adding an additional word afterwards (کتاب ها).

So the way plural forms get "stemmed" to their singular form in persian is implemented with
stopwords (ها is in the list). I think this is a clean simple approach, which is why I did
it this way.

For english, its attached to the word with an s... should we bump the posinc gap after stemmed
words in english too?

So you see, I think its dependent upon language and how you want the application to work.


> Prepare CharArraySet for Unicode 4.0
> ------------------------------------
>
>                 Key: LUCENE-2094
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2094
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 3.0
>            Reporter: Simon Willnauer
>            Assignee: Uwe Schindler
>             Fix For: 3.1
>
>         Attachments: LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.patch,
LUCENE-2094.patch, LUCENE-2094.patch, LUCENE-2094.txt, LUCENE-2094.txt, LUCENE-2094.txt
>
>
> CharArraySet does lowercaseing if created with the correspondent flag. This causes that
 String / char[] with uncode 4 chars which are in the set can not be retrieved in "ignorecase"
mode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message