lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <>
Subject [jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range
Date Wed, 09 May 2018 00:30:00 GMT


Robert Muir commented on LUCENE-7960:

The patch has a little confusion about back compat (e.g. breaks back compat with the factories
by requiring parameters that were optional before, but leaves back compat in the tokenfilters),
so I'm not sure if its geared at the master branch or not. Sometimes its easiest to make the
patch with all the back-compat, commit it to master and merge it back, then make a separate
commit to just master to remove the cruft, maybe its good in this case.

There are some cosmetic style changes such as moving attribute initialization into the ctor
instead of inline, that is different than the style of all our other tokenfilters. It makes
it hard to review the logic changes (have not looked at this, just the apis and docs).

As far as docs, I think there are easy wins. Lets take EdgeNGramTokenFilter just as an example.

For the ctor with all the parameters, it doesn't need to have documentation on what the other
ctors do: they can have their own. It should only document the behavior and parameters like
it does, so we can just remove its last line about that.

For the other ctors which are shortcuts/sugar, we can add a line such as this:
   * <p>
   * Behaves the same as {@link #EdgeNGramTokenFilter(TokenStream, int, int, boolean) 
   *                             EdgeNGramTokenFilter(input, minGram, maxGram, false)}

This helps make it clear what the shortcut/sugar is really doing with a clickable link, and
it also helps the deprecated case, if someone has to transition their code.

> NGram filters -- preserve the original token when it is outside the min/max size range
> --------------------------------------------------------------------------------------
>                 Key: LUCENE-7960
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Shawn Heisey
>            Priority: Major
>         Attachments: LUCENE-7960.patch, LUCENE-7960.patch, LUCENE-7960.patch
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
> When ngram or edgengram filters are used, any terms that are shorter than the minGramSize
are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of problems for
users.  I am not suggesting that the default behavior be changed.  That would be far too disruptive
to the existing user base.
> I do think there should be a new boolean option, with a name like keepShortTerms, that
defaults to false, to allow the short terms to be preserved.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message