lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <>
Subject Re: Splitting of words
Date Tue, 13 Sep 2005 13:00:49 GMT

On Sep 13, 2005, at 7:24 AM, Madhu Satyanarayana Panitini wrote:

> Hi Paul,
> I agree with u "Analyzer is the magic word"
> Lets look it in depth and clear, I would consider three parts in the
> analyzer
> 1. Tokenization (splitting of words)
> 2. Stopwords removal (depends up on the language)
> 3. stemming of the words (depends up on the language)
> First to start analyze we have split the text, for example I like  
> split
> the text wherever I find the following non alphabets
> "\s+|;|:|<|>|\^|~|=|--+|\+|\?|!|&|\$|@|\#|\'|`|"|_|\%|\*|,|\."
> That means I would like to split the text wherever I find
> space,:,;,",',<,>,?,  etc....
> And then we remove the stopwords and then stemming goes on.
> Coming my question is clear now how Lucene splits the text? only when
> ever it encounter the space between the words or it consider the non
> alphabetic characters as well.
> What is the whole grammar Standard analyzer has to split the words ?

Madhu - you'd do well to try out the AnalyzerDemo that comes with the  
"Lucene in Action" code.  You can download it from http:// - here's an example run:

$ ant AnalyzerDemo


      [echo]       Demonstrates analysis of sample text.
      [echo]       Refer to the "Analysis" chapter for much more on this
      [echo]       extremely crucial topic.
     [input] Press return to continue...

     [input] String to analyze: [This string will be analyzed.]

      [echo] Running lia.analysis.AnalyzerDemo...
      [java] Analyzing "This string will be analyzed."
      [java]   WhitespaceAnalyzer:
      [java]     [This] [string] [will] [be] [analyzed.]

      [java]   SimpleAnalyzer:
      [java]     [this] [string] [will] [be] [analyzed]

      [java]   StopAnalyzer:
      [java]     [string] [analyzed]

      [java]   StandardAnalyzer:
      [java]     [this] [string] [will] [be] [analyzed]

      [java]   SnowballAnalyzer:
      [java]     [this] [string] [will] [be] [analyz]

      [java]   SnowballAnalyzer:
      [java]     [this] [string] [wil] [be] [analyzed]

      [java]   SnowballAnalyzer:
      [java]     [thi] [string] [will] [be] [analyz]

Total time: 13 seconds

The StandardTokenizer is the most sophisticated one built into  
Lucene.  You can see the types of tokens it emits by looking at the  
javadoc here:

It recognizes e-mail addresses, interior apostrophe words (like  
o'clock), hostnames/IP addresses (like, acronyms,  
and CJK characters.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message