lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christoph Kaser (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-6586) There is a typo in GermanStemmer that can lead to wrong stemming
Date Fri, 26 Jun 2015 11:22:04 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-6586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14602716#comment-14602716
] 

Christoph Kaser commented on LUCENE-6586:
-----------------------------------------

Hi Michael,

I tried to write a small test case and realized that there is no input that leads to a wrong
token.
substCount is only used to decide how large the original input was, because some suffixes
are only stripped if the token has a minimum length.

{code}
if ( ( buffer.length() + substCount > 5 ) &&
      buffer.substring( buffer.length() - 2, buffer.length() ).equals( "nd" ) )
    {
      buffer.delete( buffer.length() - 2, buffer.length() );
    }
{code}

However, every substitution leaves at least one character. For the bug to take effect, there
has to be a substitution before the one that sets substCount to 2 (instead of incrementing
it by 2).
So we have
- 2 characters that where left by the (at least 2) substitutions
- the suffix  "nd" 
- substCount, which was set to 2
That sums up to 6 , which is greater than 5

The other conditions that check on substCount work the same, except they check for greater
than 4.

Therefore, there is no token that triggers any wrong behaviour.

Still, I think the typo should be fixed, because it might be copied to a place where it has
an effect.

> There is a typo in GermanStemmer that can lead to wrong stemming
> ----------------------------------------------------------------
>
>                 Key: LUCENE-6586
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6586
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 5.2.1
>            Reporter: Christoph Kaser
>            Priority: Minor
>
> There is a small typo in GermanStemmer that leads to a wrong calclulation of the substCount
in line 203:
> {code}substCount =+ 2;{code}
> should be
> {code}substCount += 2;{code}
> I created a Pull Request for this some time ago, but it was apprently overlooked:
> https://github.com/apache/lucene-solr/pull/141



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message