lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: Splitting of words
Date Tue, 13 Sep 2005 13:00:49 GMT

On Sep 13, 2005, at 7:24 AM, Madhu Satyanarayana Panitini wrote:

> Hi Paul,
>
> I agree with u "Analyzer is the magic word"
> Lets look it in depth and clear, I would consider three parts in the
> analyzer
>
> 1. Tokenization (splitting of words)
> 2. Stopwords removal (depends up on the language)
> 3. stemming of the words (depends up on the language)
>
> First to start analyze we have split the text, for example I like  
> split
> the text wherever I find the following non alphabets
> "\s+|;|:|<|>|\^|~|=|--+|\+|\?|!|&|\$|@|\#|\'|`|"|_|\%|\*|,|\."
> That means I would like to split the text wherever I find
> space,:,;,",',<,>,?,  etc....
>
> And then we remove the stopwords and then stemming goes on.
>
> Coming my question is clear now how Lucene splits the text? only when
> ever it encounter the space between the words or it consider the non
> alphabetic characters as well.
>
> What is the whole grammar Standard analyzer has to split the words ?

Madhu - you'd do well to try out the AnalyzerDemo that comes with the  
"Lucene in Action" code.  You can download it from http:// 
www.lucenebook.com - here's an example run:

$ ant AnalyzerDemo

     ...

AnalyzerDemo:
      [echo]
      [echo]       Demonstrates analysis of sample text.
      [echo]
      [echo]       Refer to the "Analysis" chapter for much more on this
      [echo]       extremely crucial topic.
      [echo]
     [input] Press return to continue...

     [input] String to analyze: [This string will be analyzed.]

      [echo] Running lia.analysis.AnalyzerDemo...
      [java] Analyzing "This string will be analyzed."
      [java]   WhitespaceAnalyzer:
      [java]     [This] [string] [will] [be] [analyzed.]

      [java]   SimpleAnalyzer:
      [java]     [this] [string] [will] [be] [analyzed]

      [java]   StopAnalyzer:
      [java]     [string] [analyzed]

      [java]   StandardAnalyzer:
      [java]     [this] [string] [will] [be] [analyzed]

      [java]   SnowballAnalyzer:
      [java]     [this] [string] [will] [be] [analyz]

      [java]   SnowballAnalyzer:
      [java]     [this] [string] [wil] [be] [analyzed]

      [java]   SnowballAnalyzer:
      [java]     [thi] [string] [will] [be] [analyz]


BUILD SUCCESSFUL
Total time: 13 seconds

The StandardTokenizer is the most sophisticated one built into  
Lucene.  You can see the types of tokens it emits by looking at the  
javadoc here:
     <http://lucene.apache.org/java/docs/api/org/apache/lucene/ 
analysis/standard/StandardTokenizer.html>

It recognizes e-mail addresses, interior apostrophe words (like  
o'clock), hostnames/IP addresses (like lucene.apache.org), acronyms,  
and CJK characters.

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message