lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: Splitting of words
Date Tue, 27 Sep 2005 16:44:15 GMT

On Sep 27, 2005, at 6:29 AM, Endre Stølsvik wrote:

> On Thu, 22 Sep 2005, Erik Hatcher wrote:
>
> |
> | On Sep 22, 2005, at 4:36 AM, Endre Stølsvik wrote:
> |
> | >
> | > | The StandardTokenizer is the most sophisticated one built  
> into Lucene.
> | > You
> | > | can see the types of tokens it emits by looking at the  
> javadoc here:
> | > |
> | > <http://lucene.apache.org/java/docs/api/org/apache/lucene/ 
> analysis/standard/StandardTokenizer.html>
> | > |
> | > | It recognizes e-mail addresses, interior apostrophe words  
> (like o'clock),
> | > | hostnames/IP addresses (like lucene.apache.org), acronyms,  
> and CJK
> | > characters.
> | >
> | > It would be great if it also separated "UpperCamelCase" and
> | > "lowerCamelCase" words into both the different words, and one  
> long word.
> | > Several uppercase, followed by lowercase, would most probably  
> be best done
> | > like HTTPUnit -> http unit.
> | >  This is of course due to, for my part, java language  
> influence. But I
> | > believe it is custom in many programming languages to use  
> lowerCamelCase
> | > for e.g. variables. Filenames too.
> |
> | I strongly disagree.  It would not be good at all for  
> StandardTokenizer to do
> | this.
>
> ...
>
> |
> | It is important to design filters and tokenizers in the most  
> single-purpose
> | way to allow them to be combined for various scenarios.
>
> Okay. Why? Just wondering what the reasoning behind this is? What  
> is the
> logic behind the StandardTokenizer as it stands? (Note: There are  
> strong
> reasons to believe that I'm just not quite up to speed on how this all
> fits together..!)

The StandardTokenizer is a general purpose tokenizer designed to  
split text not just at whitespace boundaries, but also to keep CJK  
characters separate, e-mail addresses as a unit, and to deal with  
common things like part numbers where alphabetic and numeric  
characters are mixed.  It's a good tokenizer to start with and evolve  
from there.

The StandardTokenizer does more than certain scenarios demand, and  
less than other situations - it's a nice happy medium.

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message