lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1787) Standard Tokenizer doesn't recognise I.B.M as Acronym, it requires it ends with a dot i.e I.B.M.
Date Mon, 24 Aug 2009 10:17:59 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12746802#action_12746802
] 

Michael McCandless commented on LUCENE-1787:
--------------------------------------------

The big challenge here is back compat.  Ie, if we make this fix (which is a good fix!), then
users upgrade to 2.9, suddenly queries may stop hitting the right documents because those
documents had been indexed against the old StandardAnalyzer that has this bug.  Ie, the bug
is "cached" in their index.

This is why we added "matchVersion" to StandardAnalyzer, but unfortunately we don't yet have
a clean means of carrying out matchVersion when changes to the JFlex grammar are entailed.

> Standard Tokenizer doesn't recognise I.B.M as Acronym, it requires it ends with a dot
i.e I.B.M.
> ------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1787
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1787
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Analysis
>    Affects Versions: 2.9
>            Reporter: Paul taylor
>         Attachments: LUCENE-1787.patch
>
>
> Standard Tokenzizer doesn't recognise I.B.M it requires it end with a dot i.e I.B.M.
This is particulary problematic if I.B.M is added tot the index, with the StandardAnalyser
it will get added as  IBM , a search for I.B.M will not match because I.B.M will be left as
is, I would expect a match in this scenario
> I think it could be fixed by modifying the  grammar ACRONYM_DEP  in StandardTokenizerImpl.jflex
so that it also supports
> {ALPHANUM} ("." {ALPHANUM})+
> dot only required between each character, (I'm not familiar with jflex syntax )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message