lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Lassau (JIRA)" <>
Subject [jira] Commented: (LUCENE-1373) Most of the contributed Analyzers suffer from invalid recognition of acronyms.
Date Mon, 06 Jul 2009 04:45:14 GMT


Mark Lassau commented on LUCENE-1373:

This issue is about how Lucene parses ACRONYM tokens, which must contain a dot (eg "I.B.M."),
and so you problem is certainly not exactly the same.

Whether it is related to some other issue with Lucene analysers for different languages is
not clear.
It depends on the workings of your application, and I would suggest you contact the Alfresco
developers with this question.

> Most of the contributed Analyzers suffer from invalid recognition of acronyms.
> ------------------------------------------------------------------------------
>                 Key: LUCENE-1373
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis, contrib/analyzers
>    Affects Versions: 2.3.2
>            Reporter: Mark Lassau
>            Priority: Minor
>         Attachments: LUCENE-1373.patch
> LUCENE-1068 describes a bug in StandardTokenizer whereby a string like ""
would be incorrectly tokenized as an acronym (note the dot at the end).
> Unfortunately, keeping the "backward compatibility" of a bug turns out to harm us.
> StandardTokenizer has a couple of ways to indicate "fix this bug", but unfortunately
the default behaviour is still to be buggy.
> Most of the non-English analyzers provided in lucene-analyzers utilize the StandardTokenizer,
and in v2.3.2 not one of these provides a way to get the non-buggy behaviour :(
> I refer to:
> * BrazilianAnalyzer
> * CzechAnalyzer
> * DutchAnalyzer
> * FrenchAnalyzer
> * GermanAnalyzer
> * GreekAnalyzer
> * ThaiAnalyzer

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message