lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Lassau <mlas...@atlassian.com>
Subject Re: [jira] Commented: (LUCENE-1373) Most of the contributed Analyzers suffer from invalid recognition of acronyms.
Date Thu, 04 Sep 2008 02:17:29 GMT
Grant Ingersoll (JIRA) wrote:
> Of course, it's still a bit weird, b/c in your case the type value is going to be set
to ACRONYM, when your example is clearly not one.  This suggests to me that the grammar needs
to be revisited, but that can wait until 3.0 I believe.
>
>   
Grant, not sure what you mean by "b/c in your case the type value is 
going to be set to ACRONYM, when your example is clearly not one."
Once we set replaceInvalidAcronym=true, then the type is set to HOST.

However, if you were to revisit the grammar, then I would be interested 
to get in on the discussion on the behaviour of <HOST>.
For instance, if you have a document like "visit www.apache.org", you 
currently won't get a hit if you search for "apache".
In an issue tracker like JIRA, we want to be able to search for 
"NullPointerException", and get a hit for the document "Application 
threw java.lang.NullPointerException".

Also note that the current implementation has problems if the document 
doesn't contain expected whitespace.
eg "I like Apache.They rock"
Will get tokenized to the following:
I                         <ALPHANUM>
like                    <ALPHANUM>
Apache.They    <HOST>
rock                   <ALPHANUM>

I don't think there is a simple one-size-fits-all answer to how this 
should behave. It depends on the context of the app that is using Lucene.
The best answer may be to make some of the behaviour configurable, or 
have a suite of specific analyzers?

Mark.
>> Most of the contributed Analyzers suffer from invalid recognition of acronyms.
>> ------------------------------------------------------------------------------
>>
>>                 Key: LUCENE-1373
>>                 URL: https://issues.apache.org/jira/browse/LUCENE-1373
>>             Project: Lucene - Java
>>          Issue Type: Bug
>>          Components: Analysis, contrib/analyzers
>>    Affects Versions: 2.3.2
>>            Reporter: Mark Lassau
>>            Priority: Minor
>>
>> LUCENE-1068 describes a bug in StandardTokenizer whereby a string like "www.apache.org."
would be incorrectly tokenized as an acronym (note the dot at the end).
>> Unfortunately, keeping the "backward compatibility" of a bug turns out to harm us.
>> StandardTokenizer has a couple of ways to indicate "fix this bug", but unfortunately
the default behaviour is still to be buggy.
>> Most of the non-English analyzers provided in lucene-analyzers utilize the StandardTokenizer,
and in v2.3.2 not one of these provides a way to get the non-buggy behaviour :(
>> I refer to:
>> * BrazilianAnalyzer
>> * CzechAnalyzer
>> * DutchAnalyzer
>> * FrenchAnalyzer
>> * GermanAnalyzer
>> * GreekAnalyzer
>> * ThaiAnalyzer
>>     
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message