lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range
Date Wed, 09 May 2018 03:55:00 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16468292#comment-16468292
] 

Robert Muir commented on LUCENE-7960:
-------------------------------------

{quote}
I made the min/max parameters required on the factory because the constructor without any
size parameters is deprecated. Is this something you don't like at all, or something you would
only want to see in master?
{quote}

what does it mean "not making that change in the backport to 7x" ?
As i suggested above: consider making the patch against master fully backwards compatible.
We can review it, then it can be committed, merged cleanly and safely back to 7.x. After that,
remove the deprecations in master in a separate dedicated commit.

It seems like more work, but I think its less work than trying to do a shortcut, because you
can have confidence you don't break stuff. "Making changes during backports" seems like trouble,
and having a confusing patch makes the code review hard. The current one is confusing because
it isn't really appropriate for either master (it has deprecations) nor 7x (it breaks backwards)


> NGram filters -- preserve the original token when it is outside the min/max size range
> --------------------------------------------------------------------------------------
>
>                 Key: LUCENE-7960
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7960
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Shawn Heisey
>            Priority: Major
>         Attachments: LUCENE-7960.patch, LUCENE-7960.patch, LUCENE-7960.patch
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the minGramSize
are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of problems for
users.  I am not suggesting that the default behavior be changed.  That would be far too disruptive
to the existing user base.
> I do think there should be a new boolean option, with a name like keepShortTerms, that
defaults to false, to allow the short terms to be preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message