lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Johann Höchtl (JIRA) <j...@apache.org>
Subject [jira] [Created] (LUCENE-3022) DictionaryCompoundWordTokenFilter Flag onlyLongestMatch has no affect
Date Tue, 12 Apr 2011 09:16:05 GMT
DictionaryCompoundWordTokenFilter Flag onlyLongestMatch has no affect
---------------------------------------------------------------------

                 Key: LUCENE-3022
                 URL: https://issues.apache.org/jira/browse/LUCENE-3022
             Project: Lucene - Java
          Issue Type: Bug
          Components: contrib/analyzers
    Affects Versions: 3.1, 2.9.4
            Reporter: Johann Höchtl
            Priority: Minor


When using the DictionaryCompoundWordTokenFilter with a german dictionary, I got a strange
behaviour:
The german word "streifenbluse" (blouse with stripes) was decompounded to "streifen" (stripe),"reifen"(tire)
which makes no sense at all.
I thought the flag onlyLongestMatch would fix this, because "streifen" is longer than "reifen",
but it had no effect.
So I reviewed the sourcecode and found the problem:
[code]
protected void decomposeInternal(final Token token) {
    // Only words longer than minWordSize get processed
    if (token.length() < this.minWordSize) {
      return;
    }
    
    char[] lowerCaseTermBuffer=makeLowerCaseCopy(token.buffer());
    
    for (int i=0;i<token.length()-this.minSubwordSize;++i) {
        Token longestMatchToken=null;
        for (int j=this.minSubwordSize-1;j<this.maxSubwordSize;++j) {
            if(i+j>token.length()) {
                break;
            }
            if(dictionary.contains(lowerCaseTermBuffer, i, j)) {
                if (this.onlyLongestMatch) {
                   if (longestMatchToken!=null) {
                     if (longestMatchToken.length()<j) {
                       longestMatchToken=createToken(i,j,token);
                     }
                   } else {
                     longestMatchToken=createToken(i,j,token);
                   }
                } else {
                   tokens.add(createToken(i,j,token));
                }
            } 
        }
        if (this.onlyLongestMatch && longestMatchToken!=null) {
          tokens.add(longestMatchToken);
        }
    }
  }
[/code]

should be changed to 

[code]
protected void decomposeInternal(final Token token) {
    // Only words longer than minWordSize get processed
    if (token.termLength() < this.minWordSize) {
      return;
    }

    char[] lowerCaseTermBuffer=makeLowerCaseCopy(token.termBuffer());

    Token longestMatchToken=null;
    for (int i=0;i<token.termLength()-this.minSubwordSize;++i) {

        for (int j=this.minSubwordSize-1;j<this.maxSubwordSize;++j) {
            if(i+j>token.termLength()) {
                break;
            }
            if(dictionary.contains(lowerCaseTermBuffer, i, j)) {
                if (this.onlyLongestMatch) {
                   if (longestMatchToken!=null) {
                     if (longestMatchToken.termLength()<j) {
                       longestMatchToken=createToken(i,j,token);
                     }
                   } else {
                     longestMatchToken=createToken(i,j,token);
                   }
                } else {
                   tokens.add(createToken(i,j,token));
                }
            }
        }
    }
    if (this.onlyLongestMatch && longestMatchToken!=null) {
        tokens.add(longestMatchToken);
    }
  }
[/code]

So, that only the longest token is really indexed and the onlyLongestMatch Flag makes sense.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message