lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shawn Heisey (JIRA)" <>
Subject [jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range
Date Wed, 09 May 2018 02:14:00 GMT


Shawn Heisey commented on LUCENE-7960:

All good points.

I made the min/max parameters required on the factory because the constructor without any
size parameters is deprecated.  Is this something you don't like at all, or something you
would only want to see in master?  You're right that there is some backcompat confusion in
that.  I wasn't completely sure it was a good idea, decided to proceed, with the probability
of not making that change in the backport to 7x.

If we do need to keep default values, the current defaults (1 and 2 for ngram, 1 and 1 for
edgengram) really kinda suck.  But changing them is another backcompat break.  Seems better
to completely get rid of defaults in master, and unless I misunderstood you, that seems to
be your position as well.

I would need to review the patch to be sure, but I think that the cosmetic style changes you
mentioned were made by the contributor before I started working on it.  The changes to the
inner workings of the filter looked like they were doing more than the issue requires, but
I don't understand token handling code well enough to grasp what the changes were actually
doing.  Tests pass, so if there's a problem there, it's not being caught by test coverage.
 Please feel free to adjust or make suggestions!

I'm done in for the night, and will poke at it again tomorrow.  If you could summarize everything
that you'd like to see in a new patch, that would help me.

> NGram filters -- preserve the original token when it is outside the min/max size range
> --------------------------------------------------------------------------------------
>                 Key: LUCENE-7960
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Shawn Heisey
>            Priority: Major
>         Attachments: LUCENE-7960.patch, LUCENE-7960.patch, LUCENE-7960.patch
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
> When ngram or edgengram filters are used, any terms that are shorter than the minGramSize
are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of problems for
users.  I am not suggesting that the default behavior be changed.  That would be far too disruptive
to the existing user base.
> I do think there should be a new boolean option, with a name like keepShortTerms, that
defaults to false, to allow the short terms to be preserved.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message