lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <>
Subject Re: Splitting of words
Date Thu, 22 Sep 2005 12:50:07 GMT

On Sep 22, 2005, at 4:36 AM, Endre StĂžlsvik wrote:

> | The StandardTokenizer is the most sophisticated one built into  
> Lucene.  You
> | can see the types of tokens it emits by looking at the javadoc here:
> |    < 
> analysis/standard/StandardTokenizer.html>
> |
> | It recognizes e-mail addresses, interior apostrophe words (like  
> o'clock),
> | hostnames/IP addresses (like, acronyms, and  
> CJK characters.
> It would be great if it also separated "UpperCamelCase" and
> "lowerCamelCase" words into both the different words, and one long  
> word.
> Several uppercase, followed by lowercase, would most probably be  
> best done
> like HTTPUnit -> http unit.
>   This is of course due to, for my part, java language influence.  
> But I
> believe it is custom in many programming languages to use  
> lowerCamelCase
> for e.g. variables. Filenames too.

I strongly disagree.  It would not be good at all for  
StandardTokenizer to do this.  It would be easy to write a  
CamelCaseSplitFilter that could be used in conjunction with any  

It is important to design filters and tokenizers in the most single- 
purpose way to allow them to be combined for various scenarios.

If such a filter is contributed, I'd happily add it to contrib/ 
analyzers - seems useful to have around.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message