lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shawn Heisey (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-7960) NGram filters -- add option to keep short terms
Date Tue, 01 May 2018 17:17:00 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16459873#comment-16459873
] 

Shawn Heisey commented on LUCENE-7960:
--------------------------------------

I've gotten a look at the PR.

Changing the signature on an existing constructor isn't a good idea.  Lucene is a public API
and there will be user code using that constructor that must continue to work if Lucene is
upgraded.  We should add a new constructor and have the existing constructor(s) call that
one with default values.

The only question about that is whether the existing constructor should be deprecated in stable
and removed in master.  I'm not sure who to ask.

There are some variable renames.  They don't look like problems, especially because the visibility
is private, but I'd like to get the opinion of someone who has deeper Lucene knowledge.

I'm having a difficult time following the modifications to the filter logic.  Some of the
modifications look like they're not directly related to implementing this issue, but I can't
tell for sure.


> NGram filters -- add option to keep short terms
> -----------------------------------------------
>
>                 Key: LUCENE-7960
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7960
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Shawn Heisey
>            Priority: Major
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the minGramSize
are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of problems for
users.  I am not suggesting that the default behavior be changed.  That would be far too disruptive
to the existing user base.
> I do think there should be a new boolean option, with a name like keepShortTerms, that
defaults to false, to allow the short terms to be preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message