lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rupert Westenthaler (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (LUCENE-8183) HyphenationCompoundWordTokenFilter creates overlapping tokens with onlyLongestMatch enabled
Date Tue, 27 Feb 2018 15:28:00 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-8183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16378763#comment-16378763
] 

Rupert Westenthaler edited comment on LUCENE-8183 at 2/27/18 3:27 PM:
----------------------------------------------------------------------

 Patch: [^LUCENE-8183_20180227_rwesten.diff] 

h3. New Parameters:

* {{noSubMatches}}: true/false
* {{noOverlappingMatches}}: true/false

together with the existing {{onlyLongestMatch}} those can be used to define what subwords
should be added as tokens. Functionality is as described above.

Typically users will only want to include one of the three attributes as enabling {{noOverlappingMatches}}
is the most restrictive and {{noSubMatches}} is more restrictive as {{onlyLongestMatch}}.
When enabling a more restrictive option the state of the less restrictive does not have any
effect.

Because of that it would be an option to refactor this to an single attribute with different
setting, but this would require to think about backward compatibility for configurations that
do use {{onlyLongestMatch=true}} at the moment.

h3. Algorithm

If processing of subWords is deactivated (any of {{onlyLongestMatch}},  {{noSubMatches}},
{{noOverlappingMatches}} is active) the algorithm first checks if the token is part of the
dictionary. If so it returns immediately. This is to avoid adding tokens for subwords if the
token itself is in the dictionary (see {{#testNoSubAndTokenInDictionary}} for more info).

I changed the iteration direction of the inner {{for}} loop to start with the longest possible
subword as this simplified the code. 

_NOTE:_ that this also changes the order of the Tokens in the token stream but as all tokens
are at the same position that should not make any difference. I had however to modify some
existing tests as those where sensitive to the ordering

h3 Tests

I added two test methods in {{TestCompoundWordTokenFilter}}

1. {{#testNoSubAndNoOverlap()}} tests the expected behaviour of the {{noSubMatches}} and {{noOverlappingMatches}}
options
2. {{#testNoSubAndTokenInDictionary()}} tests that no tokens for subwords are added in the
case that the token in part of the dictionary

In addition  {{TestHyphenationCompoundWordTokenFilterFactory#testLucene8183()}} asserts that
the new configuration options are parsed.

h3 Environment

This patch is based on {{master}} from {{git@github.com:apache/lucene-solr.git}} commit: {{d512cd7604741a2f55deb0c840188f0342f1ba57}}



was (Author: rwesten):
 Patch: [^LUCENE-8183_20180227_rwesten.diff] 

h3. New Parameters:

* {{noSubMatches}}: true/false
* {{noOverlappingMatches}}: true/false

together with the existing {{onlyLongestMatch}} those can be used to define what subwords
should be added as tokens. Functionality is as described above.

Typically users will only want to include one of the three attributes as enabling {{noOverlappingMatches}}
is the most restrictive and {{noSubMatches}} is more restrictive as {{onlyLongestMatch}}.
When enabling a more restrictive option the state of the less restrictive does not have any
effect.

Because of that it would be an option to refactor this to an single attribute with different
setting, but this would require to think about backward compatibility for configurations that
do use {{onlyLongestMatch=true}} at the moment.

h3. Algorithm

If processing of subWords is deactivated (any of {{onlyLongestMatch}},  {{noSubMatches}},
{{noOverlappingMatches}} is active) the algorithm first checks if the token is part of the
dictionary. If so it returns immediately. This is to avoid adding tokens for subwords if the
token itself is in the dictionary (see {{#testNoSubAndTokenInDictionary}} for more info).

I changed the iteration direction of the inner {{for}} loop to start with the longest possible
subword as this simplified the code. 

_NOTE:_ that this also changes the order of the Tokens in the token stream but as all tokens
are at the same position that should not make any difference. I had however to modify some
existing tests as those where sensitive to the ordering

h3 Tests

I added two test methods in {{TestCompoundWordTokenFilter}}

1. {{#testNoSubAndNoOverlap()}} tests the expected behaviour of the {{noSubMatches}} and {{noOverlappingMatches}}
options
2. {{#testNoSubAndTokenInDictionary()}} tests that no tokens for subwords are added in the
case that the token in part of the dictionary

In addition  {{TestHyphenationCompoundWordTokenFilterFactory#testLucene8183()}} asserts that
the new configuration options are parsed.

h3 Environment

This patch is based on {{master}} from {{git@github.com:apache/lucene-solr.git}}


> HyphenationCompoundWordTokenFilter creates overlapping tokens with onlyLongestMatch enabled
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-8183
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8183
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 6.6
>         Environment: Configuration of the analyzer:
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.HyphenationCompoundWordTokenFilterFactory" 
>         hyphenator="lang/hyph_de_DR.xml" encoding="iso-8859-1"
>          dictionary="lang/wordlist_de.txt" 
>         onlyLongestMatch="true"/>
>  
>            Reporter: Rupert Westenthaler
>            Assignee: Uwe Schindler
>            Priority: Major
>         Attachments: LUCENE-8183_20180223_rwesten.diff, LUCENE-8183_20180227_rwesten.diff,
lucene-8183.zip
>
>
> The HyphenationCompoundWordTokenFilter creates overlapping tokens even if onlyLongestMatch
is enabled. 
> Example:
> Dictionary: {{gesellschaft}}, {{schaft}}
>  Hyphenator: {{de_DR.xml}} //from Apche Offo
>  onlyLongestMatch: true
>  
> |text|gesellschaft|gesellschaft|schaft|
> |raw_bytes|[67 65 73 65 6c 6c 73 63 68 61 66 74]|[67 65 73 65 6c 6c 73 63 68 61 66 74]|[73
63 68 61 66 74]|
> |start|0|0|0|
> |end|12|12|12|
> |positionLength|1|1|1|
> |type|word|word|word|
> |position|1|1|1|
> IMHO this includes 2 unexpected Tokens
>  # the 2nd 'gesellschaft' as it duplicates the original token
>  # the 'schaft' as it is a sub-token 'gesellschaft' that is present in the dictionary
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message