From dev-return-321569-archive-asf-public=cust-asf.ponee.io@lucene.apache.org Tue May 8 22:01:11 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id A30F618063B for ; Tue, 8 May 2018 22:01:10 +0200 (CEST) Received: (qmail 44327 invoked by uid 500); 8 May 2018 20:01:04 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 44309 invoked by uid 99); 8 May 2018 20:01:04 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 May 2018 20:01:04 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 03B021A2EBD for ; Tue, 8 May 2018 20:01:04 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -101.511 X-Spam-Level: X-Spam-Status: No, score=-101.511 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id mzWeEb6kURWR for ; Tue, 8 May 2018 20:01:03 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 7CEE65FBC9 for ; Tue, 8 May 2018 20:01:02 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 617DCE1309 for ; Tue, 8 May 2018 20:01:01 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 6D036212AC for ; Tue, 8 May 2018 20:01:00 +0000 (UTC) Date: Tue, 8 May 2018 20:01:00 +0000 (UTC) From: "Robert Muir (JIRA)" To: dev@lucene.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (LUCENE-7960) NGram filters -- preserve the original token when it is outside the min/max size range MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16467894#comment-16467894 ] Robert Muir commented on LUCENE-7960: ------------------------------------- Yes, I think we should deprecate. It helps ppl upgrade and shouldn't be too bad in this case. If we currently have 1-arg (TokenStream) and 3-arg (TokenStream, int, int), and we want to end up at 2-arg (TokenStream, int) and 4-arg (TokenStream, int, int, boolean) then 7.x can temporarily have 4 constructors: the existing two of which are deprecated and forward to the new ones. Their javadoc can even explain what the forwarding is doing. master would just have the two new ones with no cruft. > NGram filters -- preserve the original token when it is outside the min/max size range > -------------------------------------------------------------------------------------- > > Key: LUCENE-7960 > URL: https://issues.apache.org/jira/browse/LUCENE-7960 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/analysis > Reporter: Shawn Heisey > Priority: Major > Attachments: LUCENE-7960.patch, LUCENE-7960.patch > > Time Spent: 0.5h > Remaining Estimate: 0h > > When ngram or edgengram filters are used, any terms that are shorter than the minGramSize are completely removed from the token stream. > This is probably 100% what was intended, but I've seen it cause a lot of problems for users. I am not suggesting that the default behavior be changed. That would be far too disruptive to the existing user base. > I do think there should be a new boolean option, with a name like keepShortTerms, that defaults to false, to allow the short terms to be preserved. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org