Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@lucene.apache.org
Date: Thu, 9 May 2013 23:05:53 +0000 (UTC)
From: "Uwe Schindler (JIRA)" <jira@apache.org>
To: dev@lucene.apache.org
Message-ID: <JIRA.12504029.1302599735002.296312.1368140753424@arcas>
In-Reply-To: <JIRA.12504029.1302599735002@arcas>
References: <JIRA.12504029.1302599735002@arcas>
Subject: [jira] [Updated] (LUCENE-3022) DictionaryCompoundWordTokenFilter
 Flag onlyLongestMatch has no affect
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


     [ https://issues.apache.org/jira/browse/LUCENE-3022?page=3Dcom.atlassi=
an.jira.plugin.system.issuetabpanels:all-tabpanel ]

Uwe Schindler updated LUCENE-3022:
----------------------------------

    Fix Version/s:     (was: 4.3)
                   4.4
   =20
> DictionaryCompoundWordTokenFilter Flag onlyLongestMatch has no affect
> ---------------------------------------------------------------------
>
>                 Key: LUCENE-3022
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3022
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 2.9.4, 3.1
>            Reporter: Johann H=C3=B6chtl
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 4.4
>
>         Attachments: LUCENE-3022.patch, LUCENE-3022.patch
>
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> When using the DictionaryCompoundWordTokenFilter with a german dictionary=
, I got a strange behaviour:
> The german word "streifenbluse" (blouse with stripes) was decompounded to=
 "streifen" (stripe),"reifen"(tire) which makes no sense at all.
> I thought the flag onlyLongestMatch would fix this, because "streifen" is=
 longer than "reifen", but it had no effect.
> So I reviewed the sourcecode and found the problem:
> [code]
> protected void decomposeInternal(final Token token) {
>     // Only words longer than minWordSize get processed
>     if (token.length() < this.minWordSize) {
>       return;
>     }
>    =20
>     char[] lowerCaseTermBuffer=3DmakeLowerCaseCopy(token.buffer());
>    =20
>     for (int i=3D0;i<token.length()-this.minSubwordSize;++i) {
>         Token longestMatchToken=3Dnull;
>         for (int j=3Dthis.minSubwordSize-1;j<this.maxSubwordSize;++j) {
>             if(i+j>token.length()) {
>                 break;
>             }
>             if(dictionary.contains(lowerCaseTermBuffer, i, j)) {
>                 if (this.onlyLongestMatch) {
>                    if (longestMatchToken!=3Dnull) {
>                      if (longestMatchToken.length()<j) {
>                        longestMatchToken=3DcreateToken(i,j,token);
>                      }
>                    } else {
>                      longestMatchToken=3DcreateToken(i,j,token);
>                    }
>                 } else {
>                    tokens.add(createToken(i,j,token));
>                 }
>             }=20
>         }
>         if (this.onlyLongestMatch && longestMatchToken!=3Dnull) {
>           tokens.add(longestMatchToken);
>         }
>     }
>   }
> [/code]
> should be changed to=20
> [code]
> protected void decomposeInternal(final Token token) {
>     // Only words longer than minWordSize get processed
>     if (token.termLength() < this.minWordSize) {
>       return;
>     }
>     char[] lowerCaseTermBuffer=3DmakeLowerCaseCopy(token.termBuffer());
>     Token longestMatchToken=3Dnull;
>     for (int i=3D0;i<token.termLength()-this.minSubwordSize;++i) {
>         for (int j=3Dthis.minSubwordSize-1;j<this.maxSubwordSize;++j) {
>             if(i+j>token.termLength()) {
>                 break;
>             }
>             if(dictionary.contains(lowerCaseTermBuffer, i, j)) {
>                 if (this.onlyLongestMatch) {
>                    if (longestMatchToken!=3Dnull) {
>                      if (longestMatchToken.termLength()<j) {
>                        longestMatchToken=3DcreateToken(i,j,token);
>                      }
>                    } else {
>                      longestMatchToken=3DcreateToken(i,j,token);
>                    }
>                 } else {
>                    tokens.add(createToken(i,j,token));
>                 }
>             }
>         }
>     }
>     if (this.onlyLongestMatch && longestMatchToken!=3Dnull) {
>         tokens.add(longestMatchToken);
>     }
>   }
> [/code]
> So, that only the longest token is really indexed and the onlyLongestMatc=
h Flag makes sense.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrato=
rs
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org