lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Lassau (JIRA)" <j...@apache.org>
Subject [jira] Created: (LUCENE-1373) Most of the contributed Analyzers suffer from invalid recognition of acronyms.
Date Wed, 03 Sep 2008 00:57:44 GMT
Most of the contributed Analyzers suffer from invalid recognition of acronyms.
------------------------------------------------------------------------------

                 Key: LUCENE-1373
                 URL: https://issues.apache.org/jira/browse/LUCENE-1373
             Project: Lucene - Java
          Issue Type: Bug
          Components: Analysis, contrib/analyzers
    Affects Versions: 2.3.2
            Reporter: Mark Lassau
            Priority: Minor


LUCENE-1068 describes a bug in StandardTokenizer whereby a string like "www.apache.org." would
be incorrectly tokenized as an acronym (note the dot at the end).

Unfortunately, keeping the "backward compatibility" of a bug turns out to harm us.
StandardTokenizer has a couple of ways to indicate "fix this bug", but unfortunately the default
behaviour is still to be buggy.
Most of the non-English analyzers provided in lucene-analyzers utilize the StandardTokenizer,
and in v2.3.2 not one of these provides a way to get the non-buggy behaviour :(

I refer to:
* BrazilianAnalyzer
* CzechAnalyzer
* DutchAnalyzer
* FrenchAnalyzer
* GermanAnalyzer
* GreekAnalyzer
* ThaiAnalyzer

I would be willing to contribute a patch to make these Analyzers work in the next point release.

I see two ways to do this:
1) Introduce a static method to StandardTokenizerImpl, whereby you could set the "default"
value of the replaceInvalidAcronym flag.
    One could then call setDefaultForReplaceInvalidAcronym(true) one time from your code,
 and then whenever anyone uses the old Constructor, it would set replaceInvalidAcronym=true
2) Add the replaceInvalidAcronym flag to all of the above Analyzers.
    Some of these have multiple constructors already, so I would probably just add a setter/getter
to them.

The question is, which of the above would be preferred?
Personally, I think the first is the least amount of work to do, and also the easiest to back
out when you move onto v3.x, and the "deprecated" behaviour is removed.
However, doing 2) means the least disruption to core code.

Also, judging by the "Fix Version/s" field above, I am guessing that a v2.3.3 release is planned,
therefore I guess I should provide a patch for the 2.3 branch as well as trunk which will
end up as 2.4?


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message