lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Endre StĂžlsvik <En...@Stolsvik.com>
Subject Re: Splitting of words
Date Thu, 22 Sep 2005 08:36:29 GMT

| The StandardTokenizer is the most sophisticated one built into Lucene.  You
| can see the types of tokens it emits by looking at the javadoc here:
|    <http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/standard/StandardTokenizer.html>
| 
| It recognizes e-mail addresses, interior apostrophe words (like o'clock),
| hostnames/IP addresses (like lucene.apache.org), acronyms, and CJK characters.

It would be great if it also separated "UpperCamelCase" and 
"lowerCamelCase" words into both the different words, and one long word. 
Several uppercase, followed by lowercase, would most probably be best done 
like HTTPUnit -> http unit.
  This is of course due to, for my part, java language influence. But I 
believe it is custom in many programming languages to use lowerCamelCase 
for e.g. variables. Filenames too.

Regards,
Endre.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message