lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Endre Stølsvik <En...@Stolsvik.com>
Subject Re: Splitting of words
Date Tue, 27 Sep 2005 10:29:49 GMT
On Thu, 22 Sep 2005, Erik Hatcher wrote:

| 
| On Sep 22, 2005, at 4:36 AM, Endre Stølsvik wrote:
| 
| > 
| > | The StandardTokenizer is the most sophisticated one built into Lucene.
| > You
| > | can see the types of tokens it emits by looking at the javadoc here:
| > |
| > <http://lucene.apache.org/java/docs/api/org/apache/lucene/analysis/standard/StandardTokenizer.html>
| > |
| > | It recognizes e-mail addresses, interior apostrophe words (like o'clock),
| > | hostnames/IP addresses (like lucene.apache.org), acronyms, and CJK
| > characters.
| > 
| > It would be great if it also separated "UpperCamelCase" and
| > "lowerCamelCase" words into both the different words, and one long word.
| > Several uppercase, followed by lowercase, would most probably be best done
| > like HTTPUnit -> http unit.
| >  This is of course due to, for my part, java language influence. But I
| > believe it is custom in many programming languages to use lowerCamelCase
| > for e.g. variables. Filenames too.
| 
| I strongly disagree.  It would not be good at all for StandardTokenizer to do
| this. 

...

|
| It is important to design filters and tokenizers in the most single-purpose
| way to allow them to be combined for various scenarios.

Okay. Why? Just wondering what the reasoning behind this is? What is the 
logic behind the StandardTokenizer as it stands? (Note: There are strong 
reasons to believe that I'm just not quite up to speed on how this all 
fits together..!)

| It would be easy to write a CamelCaseSplitFilter that could be used in 
| conjunction with any tokenizer.

Thanks for the tip!

Regards,
Endre

Mime
View raw message