lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonik Seeley (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-438) add Token.setTermText(), remove final
Date Fri, 23 Sep 2005 02:56:28 GMT
    [ http://issues.apache.org/jira/browse/LUCENE-438?page=comments#action_12330250 ] 

Yonik Seeley commented on LUCENE-438:
-------------------------------------

Mostly to convey information across TokenFilters, and the single type string isn't sufficient.
For exampe, I'd like to have an int or long that can be used as 32 or 64 independent flags.

In general, having attributes you can dynamically attach to tokens allows to you decompose
token filters to more basic functions and thus gives greater power to filter chains.

Some use cases I can think of:
 - conditionals... mark tokens in one filter and conditionally act on them in another.

 - decouple the marking of tokens from the transformation of tokens... one could have a 
  TokenMatcherFilter that would tag certain tokens that matched a regex, for example.

 - protected tokens: mark certain words as "do not change", "do not stem", "do not lowercase"
for instance.

 - mark tokens that are split from a larger token (for example when a camelCase filter splits
"fooBar" into "foo Bar") so they may be treated differently by other filters
 
 - performance (hey it comes for free).  You can do things like StandardTokenFilter, which
checks the type of the token and doesn't have to re-parse every single token.

I've already had to implement TokenFilter functionality (protected tokens, token splitting
and combining)  where I've had to stuff more functionallity than I'd like into a single filter
because of then inability of one filter to provide more info to another.

So I think there's a strong case for being able to dynamically add attributes (set bit flags)
on a token.  I planned on subclassing Token to achieve that.  But because I don't know what
other people may need/want in the future, making it so one can provide extensions to Token
via inheritance seems like a good thing.



> add Token.setTermText(), remove final
> -------------------------------------
>
>          Key: LUCENE-438
>          URL: http://issues.apache.org/jira/browse/LUCENE-438
>      Project: Lucene - Java
>         Type: Improvement
>     Versions: CVS Nightly - Specify date in submission
>     Reporter: Yonik Seeley
>     Priority: Minor
>  Attachments: yonik_Token.txt
>
> The Token class should be more friendly to classes not in it's package:
>   1) add setTermText()
>   2) remove final from class and toString()
>   3) add clone()
> Support for (1):
>   TokenFilters in the same package as Token are able to do things like 
>    "t.termText = t.termText.toLowerCase();" which is more efficient, but more importantly
less error prone.  Without the ability to change *only* the term text, a new Token must be
created, and one must remember to set all the properties correctly.  This exact issue caused
this bug:
> http://issues.apache.org/jira/browse/LUCENE-437
> Support for (2):
>   Removing final allows one to subclass Token.  I didn't see any performance impact after
removing final.
> I can go into more detail on why I want to subclass Token if anyone is interested.
> Support for (3):
>   - support for a synonym TokenFilter, where one needs to make two tokens from one (same
args that support (1), and esp important if instance is a subclass of Token).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message