lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: Splitting of words
Date Thu, 22 Sep 2005 12:50:07 GMT

On Sep 22, 2005, at 4:36 AM, Endre StĂžlsvik wrote:

>
> | The StandardTokenizer is the most sophisticated one built into  
> Lucene.  You
> | can see the types of tokens it emits by looking at the javadoc here:
> |    <http://lucene.apache.org/java/docs/api/org/apache/lucene/ 
> analysis/standard/StandardTokenizer.html>
> |
> | It recognizes e-mail addresses, interior apostrophe words (like  
> o'clock),
> | hostnames/IP addresses (like lucene.apache.org), acronyms, and  
> CJK characters.
>
> It would be great if it also separated "UpperCamelCase" and
> "lowerCamelCase" words into both the different words, and one long  
> word.
> Several uppercase, followed by lowercase, would most probably be  
> best done
> like HTTPUnit -> http unit.
>   This is of course due to, for my part, java language influence.  
> But I
> believe it is custom in many programming languages to use  
> lowerCamelCase
> for e.g. variables. Filenames too.

I strongly disagree.  It would not be good at all for  
StandardTokenizer to do this.  It would be easy to write a  
CamelCaseSplitFilter that could be used in conjunction with any  
tokenizer.

It is important to design filters and tokenizers in the most single- 
purpose way to allow them to be combined for various scenarios.

If such a filter is contributed, I'd happily add it to contrib/ 
analyzers - seems useful to have around.

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message