Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 52108 invoked from network); 23 Sep 2005 02:56:53 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 23 Sep 2005 02:56:53 -0000 Received: (qmail 17870 invoked by uid 500); 23 Sep 2005 02:56:50 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 17854 invoked by uid 500); 23 Sep 2005 02:56:50 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 17841 invoked by uid 99); 23 Sep 2005 02:56:50 -0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [192.87.106.226] (HELO ajax.apache.org) (192.87.106.226) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Sep 2005 19:56:49 -0700 Received: from ajax.apache.org (ajax.apache.org [127.0.0.1]) by ajax.apache.org (Postfix) with ESMTP id 0C10EC9 for ; Fri, 23 Sep 2005 04:56:28 +0200 (CEST) Message-ID: <270788205.1127444188006.JavaMail.jira@ajax.apache.org> Date: Fri, 23 Sep 2005 04:56:28 +0200 (CEST) From: "Yonik Seeley (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-438) add Token.setTermText(), remove final In-Reply-To: <1808228650.1127421987745.JavaMail.jira@ajax.apache.org> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N [ http://issues.apache.org/jira/browse/LUCENE-438?page=comments#action_12330250 ] Yonik Seeley commented on LUCENE-438: ------------------------------------- Mostly to convey information across TokenFilters, and the single type string isn't sufficient. For exampe, I'd like to have an int or long that can be used as 32 or 64 independent flags. In general, having attributes you can dynamically attach to tokens allows to you decompose token filters to more basic functions and thus gives greater power to filter chains. Some use cases I can think of: - conditionals... mark tokens in one filter and conditionally act on them in another. - decouple the marking of tokens from the transformation of tokens... one could have a TokenMatcherFilter that would tag certain tokens that matched a regex, for example. - protected tokens: mark certain words as "do not change", "do not stem", "do not lowercase" for instance. - mark tokens that are split from a larger token (for example when a camelCase filter splits "fooBar" into "foo Bar") so they may be treated differently by other filters - performance (hey it comes for free). You can do things like StandardTokenFilter, which checks the type of the token and doesn't have to re-parse every single token. I've already had to implement TokenFilter functionality (protected tokens, token splitting and combining) where I've had to stuff more functionallity than I'd like into a single filter because of then inability of one filter to provide more info to another. So I think there's a strong case for being able to dynamically add attributes (set bit flags) on a token. I planned on subclassing Token to achieve that. But because I don't know what other people may need/want in the future, making it so one can provide extensions to Token via inheritance seems like a good thing. > add Token.setTermText(), remove final > ------------------------------------- > > Key: LUCENE-438 > URL: http://issues.apache.org/jira/browse/LUCENE-438 > Project: Lucene - Java > Type: Improvement > Versions: CVS Nightly - Specify date in submission > Reporter: Yonik Seeley > Priority: Minor > Attachments: yonik_Token.txt > > The Token class should be more friendly to classes not in it's package: > 1) add setTermText() > 2) remove final from class and toString() > 3) add clone() > Support for (1): > TokenFilters in the same package as Token are able to do things like > "t.termText = t.termText.toLowerCase();" which is more efficient, but more importantly less error prone. Without the ability to change *only* the term text, a new Token must be created, and one must remember to set all the properties correctly. This exact issue caused this bug: > http://issues.apache.org/jira/browse/LUCENE-437 > Support for (2): > Removing final allows one to subclass Token. I didn't see any performance impact after removing final. > I can go into more detail on why I want to subclass Token if anyone is interested. > Support for (3): > - support for a synonym TokenFilter, where one needs to make two tokens from one (same args that support (1), and esp important if instance is a subclass of Token). -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org