lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alan Woodward (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-8273) Add a ConditionalTokenFilter
Date Wed, 23 May 2018 08:15:00 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-8273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16486886#comment-16486886
] 

Alan Woodward commented on LUCENE-8273:
---------------------------------------

The elastic CI has found some reproducing seeds in TestRandomChains that look like the following:
{code}
Suite: org.apache.lucene.analysis.core.TestRandomChains
01:47:39    [junit4]   2> Exception from random analyzer: 
01:47:39    [junit4]   2> charfilters=
01:47:39    [junit4]   2>   org.apache.lucene.analysis.fa.PersianCharFilter(java.io.StringReader@36de1051)
01:47:39    [junit4]   2>   org.apache.lucene.analysis.charfilter.MappingCharFilter(org.apache.lucene.analysis.charfilter.NormalizeCharMap@31483c67,
org.apache.lucene.analysis.fa.PersianCharFilter@51a9d324)
01:47:39    [junit4]   2> tokenizer=
01:47:39    [junit4]   2>   org.apache.lucene.analysis.core.UnicodeWhitespaceTokenizer(org.apache.lucene.util.AttributeFactory$1@27232fb3,
35)
01:47:39    [junit4]   2> filters=ConditionalTokenFilter: 
01:47:39    [junit4]   2>   org.apache.lucene.analysis.compound.HyphenationCompoundWordTokenFilter(OneTimeWrapper@5f621e45
term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1,
org.apache.lucene.analysis.compound.hyphenation.HyphenationTree@40cdd67e)ConditionalTokenFilter:

01:47:39    [junit4]   2>   org.apache.lucene.analysis.in.IndicNormalizationFilter(OneTimeWrapper@2de2e47c
term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1)ConditionalTokenFilter:

01:47:39    [junit4]   2>   org.apache.lucene.analysis.MockRandomLookaheadTokenFilter(java.util.Random@4ced13ac,
OneTimeWrapper@7d30a80d term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word,termFrequency=1)
01:47:39    [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestRandomChains -Dtests.method=testRandomChainsWithLargeStrings
-Dtests.seed=72E157E8E16C0F79 -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=en-US
-Dtests.timezone=America/Anguilla -Dtests.asserts=true -Dtests.file.encoding=US-ASCII
01:47:39    [junit4] FAILURE 0.57s J0 | TestRandomChains.testRandomChainsWithLargeStrings
<<<
01:47:39    [junit4]    > Throwable #1: java.lang.AssertionError
01:47:39    [junit4]    > 	at __randomizedtesting.SeedInfo.seed([72E157E8E16C0F79:18BAE8F9B8222F8A]:0)
01:47:39    [junit4]    > 	at org.apache.lucene.analysis.LookaheadTokenFilter.peekToken(LookaheadTokenFilter.java:140)
{code}

The root cause is that LookaheadTokenFilter doesn't play well with ConditionalTokenFilter
when we have stacked tokens:
- CTF works by presenting the underlying TokenStream to its wrapped filter as a series of
snippets, demarcated by tokens that don't pass the {{shouldFilter()}} test.  When a new snippet
is started (i.e. when a token that passes {{shouldFilter()}} appears after one that doesn't)
then {{reset()}} is called on the delegate, and when it stops (i.e. when a token that doesn't
pass {{shouldFilter()}} appears) then {{end()}} is called.
- This means that if we have stacked tokens, with the first not passing {{shouldFilter()}}
and the second passing it, the wrapped filter can see a TokenStream that has an initial position
increment of 0
- LookaheadTokenFilter has an explicit assertion that checks we don't have an initial posInc
of 0

I think this can be fixed by having a posInc adjustment when we're delegating, so that the
delegated snippet starts with a posInc of 1, but this is then adjusted downwards by the CTF
before it's emitted.

> Add a ConditionalTokenFilter
> ----------------------------
>
>                 Key: LUCENE-8273
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8273
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Alan Woodward
>            Assignee: Alan Woodward
>            Priority: Major
>             Fix For: 7.4
>
>         Attachments: LUCENE-8273-2.patch, LUCENE-8273-2.patch, LUCENE-8273-part2-rebased.patch,
LUCENE-8273-part2-rebased.patch, LUCENE-8273-part2.patch, LUCENE-8273-part2.patch, LUCENE-8273.patch,
LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch, LUCENE-8273.patch,
LUCENE-8273.patch, LUCENE-8273.patch
>
>
> Spinoff of LUCENE-8265.  It would be useful to be able to wrap a TokenFilter in such
a way that it could optionally be bypassed based on the current state of the TokenStream.
 This could be used to, for example, only apply WordDelimiterFilter to terms that contain
hyphens.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message