lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Johann Höchtl (JIRA) <j...@apache.org>
Subject [jira] [Commented] (LUCENE-3022) DictionaryCompoundWordTokenFilter Flag onlyLongestMatch has no affect
Date Wed, 20 Apr 2011 20:28:05 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13022351#comment-13022351
] 

Johann Höchtl commented on LUCENE-3022:
---------------------------------------

Let's stay with this example:
Dict: {"soft","so","ft"}
Word: "softball"

The first option onlyLongestMatch, which behaves in the way, that only the longest matching
dictionary entry should be returned. (Should this option be modified to keep all of the longest
matches? length(soft) == length(ball)?)
Output: "soft" if true; "so","ft","soft" if false

The second option should be keepRemain, which makes a term out of the remain after substracting
the longestMatch (makes only sense with onlyLongestMatch!?)
Output: "soft","ball" if keepRemain==onlyLongestMatch==true

With this second option you could keep the remains, which are not in your dictionary (reduces
the complexity of the required dictionary and can improve the compound-splitting-logic)


> DictionaryCompoundWordTokenFilter Flag onlyLongestMatch has no affect
> ---------------------------------------------------------------------
>
>                 Key: LUCENE-3022
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3022
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>    Affects Versions: 2.9.4, 3.1
>            Reporter: Johann Höchtl
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-3022.patch, LUCENE-3022.patch
>
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> When using the DictionaryCompoundWordTokenFilter with a german dictionary, I got a strange
behaviour:
> The german word "streifenbluse" (blouse with stripes) was decompounded to "streifen"
(stripe),"reifen"(tire) which makes no sense at all.
> I thought the flag onlyLongestMatch would fix this, because "streifen" is longer than
"reifen", but it had no effect.
> So I reviewed the sourcecode and found the problem:
> [code]
> protected void decomposeInternal(final Token token) {
>     // Only words longer than minWordSize get processed
>     if (token.length() < this.minWordSize) {
>       return;
>     }
>     
>     char[] lowerCaseTermBuffer=makeLowerCaseCopy(token.buffer());
>     
>     for (int i=0;i<token.length()-this.minSubwordSize;++i) {
>         Token longestMatchToken=null;
>         for (int j=this.minSubwordSize-1;j<this.maxSubwordSize;++j) {
>             if(i+j>token.length()) {
>                 break;
>             }
>             if(dictionary.contains(lowerCaseTermBuffer, i, j)) {
>                 if (this.onlyLongestMatch) {
>                    if (longestMatchToken!=null) {
>                      if (longestMatchToken.length()<j) {
>                        longestMatchToken=createToken(i,j,token);
>                      }
>                    } else {
>                      longestMatchToken=createToken(i,j,token);
>                    }
>                 } else {
>                    tokens.add(createToken(i,j,token));
>                 }
>             } 
>         }
>         if (this.onlyLongestMatch && longestMatchToken!=null) {
>           tokens.add(longestMatchToken);
>         }
>     }
>   }
> [/code]
> should be changed to 
> [code]
> protected void decomposeInternal(final Token token) {
>     // Only words longer than minWordSize get processed
>     if (token.termLength() < this.minWordSize) {
>       return;
>     }
>     char[] lowerCaseTermBuffer=makeLowerCaseCopy(token.termBuffer());
>     Token longestMatchToken=null;
>     for (int i=0;i<token.termLength()-this.minSubwordSize;++i) {
>         for (int j=this.minSubwordSize-1;j<this.maxSubwordSize;++j) {
>             if(i+j>token.termLength()) {
>                 break;
>             }
>             if(dictionary.contains(lowerCaseTermBuffer, i, j)) {
>                 if (this.onlyLongestMatch) {
>                    if (longestMatchToken!=null) {
>                      if (longestMatchToken.termLength()<j) {
>                        longestMatchToken=createToken(i,j,token);
>                      }
>                    } else {
>                      longestMatchToken=createToken(i,j,token);
>                    }
>                 } else {
>                    tokens.add(createToken(i,j,token));
>                 }
>             }
>         }
>     }
>     if (this.onlyLongestMatch && longestMatchToken!=null) {
>         tokens.add(longestMatchToken);
>     }
>   }
> [/code]
> So, that only the longest token is really indexed and the onlyLongestMatch Flag makes
sense.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message