lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-7622) Should BaseTokenStreamTestCase catch analyzers that create duplicate tokens?
Date Sat, 07 Jan 2017 10:49:58 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-7622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15807277#comment-15807277
] 

Uwe Schindler commented on LUCENE-7622:
---------------------------------------

For the above boosting use cases, it would be better to have an additional attribute in TokenStreams
that default to 1, but returns a "frequency" or "boost" if used. Then you could stop cloning
the tokens. FYI: I know that BM25 makes this type of boosting harder, but you can still add
emphasis on tokens in a text by duplicating them

> Should BaseTokenStreamTestCase catch analyzers that create duplicate tokens?
> ----------------------------------------------------------------------------
>
>                 Key: LUCENE-7622
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7622
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>         Attachments: LUCENE-7622.patch
>
>
> The change to BTSTC is quite simple, to catch any case where the same term text spans
from the same position with the same position length. Such duplicate tokens are silly to add
to the index, or to search at search time.
> Yet, this change produced many failures, and I looked briefly at them, and they are cases
that I think are actually OK, e.g. {{PatternCaptureGroupTokenFilter}} capturing (..)(..) on
the string {{ktkt}} will create a duplicate token.
> Other cases looked more dubious, e.g. {{WordDelimiterFilter}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message