Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 77209200D51 for ; Fri, 22 Dec 2017 12:51:24 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 75748160C19; Fri, 22 Dec 2017 11:51:24 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id BA7C9160BFD for ; Fri, 22 Dec 2017 12:51:23 +0100 (CET) Received: (qmail 44159 invoked by uid 500); 22 Dec 2017 11:51:22 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 44146 invoked by uid 99); 22 Dec 2017 11:51:21 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 22 Dec 2017 11:51:21 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 779201A1355 for ; Fri, 22 Dec 2017 11:51:21 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -2.402 X-Spam-Level: X-Spam-Status: No, score=-2.402 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_DNSWL_MED=-2.3, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=sib.swiss Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id 2xwgdPS9qnK4 for ; Fri, 22 Dec 2017 11:51:19 +0000 (UTC) Received: from bell.isb-sib.ch (bell-sib.unige.ch [192.33.215.135]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 7B7A85F1BA for ; Fri, 22 Dec 2017 11:51:19 +0000 (UTC) Received: from bell (bell.isb-sib.ch [192.33.215.135]) by bell.isb-sib.ch with ESMTP id vBMBpAca006231 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO) for ; Fri, 22 Dec 2017 12:51:10 +0100 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sib.swiss; s=bellswiss; t=1513943471; bh=DZy8QZfEgTo2hGGiF1su0va7tg2bIJ2GJ8rRBXRYq5w=; h=To:From:Subject:Date:From; b=GRUhhcBY3BiWnCDcdLu/pZzH44swQbzHnbmzkHqNzDsh7KQ1FB5810uFZ3dhn6b7I BcXaiNCmSnmMbFr0wTD2ALmEfIL4z6qDl67QJTpvsv5LZ3GoBQ1B0fAq87ub3WQB0g ksk3Z1vLg6VCHVswWw4tGQSuAIlqyBm3vM4Gz82o= To: java-user@lucene.apache.org From: Parit Bansal Subject: WordDelimiterIterator word splitting usecase Message-ID: <26d4094f-c0c1-c1ca-706f-f534288752c4@sib.swiss> Date: Fri, 22 Dec 2017 12:51:10 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.5.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-US X-MailScanner-Information: @sib.swiss Please contact SIB for more information X-MailScanner-ID: @sib.swiss vBMBpAca006231 X-MailScanner: @sib.swiss Found to be clean X-MailScanner-From: @sib.swiss parit.bansal@sib.swiss archived-at: Fri, 22 Dec 2017 11:51:24 -0000 Hi, I have been migrating and maintaining lucene indexing code for our use case since 2.x version (now we are are 6.6.1 migrating to 7.x) . One problem I am constantly facing is regarding org.apache.lucene.analysis.miscellaneous.WordDelimiterIterator class that is defined final in lucene codebase.  In this class, there is a isBreak() method that defines when to split a word into subwords. One of the cases is *ALPHA->NUMERIC, NUMERIC->ALPHA :Don't split* (in the same if condition) . Unfortunately, in my use case we strictly want *NUMERIC->ALPHA :Don't split* and there is no way around to change this behavior using the configurationFlags. Since this isBreak() method is private and WordDelimiterFilterIterator class final therefore there is no possibility for subclassing and overriding this method. Also, WordDelimiterFilterIterator is tightly coupled with WordDelimiterFilter (WordDelimiterGraphFilter in 7.x) and both are final. So this leaves me with only one option to copy paste their code into custom classes and change the behaviour. Clearly this is not a maintainable solution. So, I am looking for advise what else is possible? OR is there a possibility of a patch/refactoring to fix isBreak() to use some new configuration flags? - Best Parit Bansal (Developer www.uniprot.org) --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org