lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Rowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-7857) CharTokenizer-derived tokenizers and KeywordTokenizer emit multiple tokens when the max length is exceeded
Date Wed, 31 May 2017 01:08:05 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-7857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16030459#comment-16030459
] 

Steve Rowe commented on LUCENE-7857:
------------------------------------

I agree with Robert.

See my answer to a question about why StandardTokenizer effectively splits tokens that are
longer than maxTokenLength in this recent java-user mailing list thread: [https://lists.apache.org/thread.html/42af955be9522cff0d28b47d7fa723d90846ad011157503fcf687f99@%3Cjava-user.lucene.apache.org%3E].

The workaround I outlined on that thread would work here too: set maxTokenLength super-high,
then use LengthFilter to remove tokens longer than what you want to keep.

> CharTokenizer-derived tokenizers and KeywordTokenizer emit multiple tokens when the max
length is exceeded
> ----------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-7857
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7857
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>
> Assigning to myself to not lose track of it.
> LUCENE-7705 introduced the ability to define the allowable token length for these tokenizers
other than hard-code it to 255. It's always been the case that when the hard-coded limit was
exceeded, multiple tokens would be emitted. However, the tests for LUCENE-7705 exposed a problem.
> Suppose the max length is 3 and the doc contains "letter". Two tokens are emitted and
indexed: "let" and "ter".
> Now suppose the search is for "lett". If the default operator is AND or phrase queries
are constructed the query fails since the tokens emitted are "let" and "t". Only if the operator
is OR is the document found, and even then it won't be correct since searching for "lett"
would match a document indexed with "bett" because it would match on the bare "t".
> Proposal: 
> The remainder of the token should be ignored when maxTokenLen is exceeded.
> [~rcmuir][~steve_rowe][~tomasflobbe] comments? Again, this behavior was not introduced
by LUCENE-7705, it's just that it would be very hard to notice with the default 255 char limit.
> I'm not quite sure why master generates a parsed query of:
> field:let field:t
> and 6x generates
> field:"let t"
> so the tests succeeded on master but not on 6x....



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message