lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shawn Heisey (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms
Date Wed, 02 May 2018 22:40:00 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16461695#comment-16461695
] 

Shawn Heisey commented on LUCENE-7960:
--------------------------------------

My original idea would have been handled by one boolean -- keeping terms shorter than minGram.
 On more than one occasion, I've fielded questions where it turns out the user is trying to
search for terms shorter than their minGram size.

In discussing it, the notion of *long* terms being removed by the min/max range also came
up.  It was an idea I had not originally considered, but I have encountered someone since
where they had ngram on the index side but not the query side, and wanted to search for terms
longer than their maxGram size.

It could be reduced to one "keep" boolean to keep both short and long terms, but I think we're
going to have people who want to keep short terms but not long terms, and vice versa.


> NGram filters -- add option to keep short terms
> -----------------------------------------------
>
>                 Key: LUCENE-7960
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7960
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Shawn Heisey
>            Priority: Major
>         Attachments: LUCENE-7960.patch, LUCENE-7960.patch
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the minGramSize
are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of problems for
users.  I am not suggesting that the default behavior be changed.  That would be far too disruptive
to the existing user base.
> I do think there should be a new boolean option, with a name like keepShortTerms, that
defaults to false, to allow the short terms to be preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message