incubator-lucy-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Karman <pe...@peknet.com>
Subject Re: [lucy-dev] Extending the StandardTokenizer
Date Fri, 24 Feb 2012 03:20:39 GMT
Marvin Humphrey wrote on 2/23/12 5:27 PM:

> 
> In the meantime, if you want to commit WordTokenizer, I won't object.  FWIW, I
> believe that the analogous Lucene class is called "LetterTokenizer", so you
> might consider renaming it.
> 
>   http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/all/org/apache/lucene/analysis/LetterTokenizer.html
> 
> However, I hope we can manage to create an extension mechanism before the
> release of Lucy 0.4.0 which allows the user to code up the equivalent of
> WordTokenizer as a user-space subclass, and that you won't object to the
> removing WordTokenizer before it escapes into the wild in that case.
> 

+1 to Marvin's hope.

-0 to committing the code, and -1 to the name WordTokenizer. Maybe this is
bikeshedding, but that name just seems misleading. Does the Standard tokenizer
also tokenize words? The term 'word' is just too overloaded here.
LetterTokenizer is slightly better, but I share Marvin's hope that we can find a
way to get the performance love at the host language subclass level without
needing to support multiple variations on a theme in the core dist.

Let's figure out a way to avoid the naming conversation altogether and extend
the StandardTokenizer to do what you need, Nick.


-- 
Peter Karman  .  http://peknet.com/  .  peter@peknet.com

Mime
View raw message