lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Madhu Satyanarayana Panitini" <Madhu.Panit...@pass-consulting.com>
Subject RE: Splitting of words
Date Tue, 13 Sep 2005 11:24:55 GMT
Hi Paul,

I agree with u "Analyzer is the magic word"
Lets look it in depth and clear, I would consider three parts in the
analyzer 

1. Tokenization (splitting of words)
2. Stopwords removal (depends up on the language)
3. stemming of the words (depends up on the language) 

First to start analyze we have split the text, for example I like split
the text wherever I find the following non alphabets 
"\s+|;|:|<|>|\^|~|=|--+|\+|\?|!|&|\$|@|\#|\'|`|"|_|\%|\*|,|\." 
That means I would like to split the text wherever I find
space,:,;,",',<,>,?,  etc....

And then we remove the stopwords and then stemming goes on.

Coming my question is clear now how Lucene splits the text? only when
ever it encounter the space between the words or it consider the non
alphabetic characters as well.

What is the whole grammar Standard analyzer has to split the words ?

Madhu






Madhu Satyanarayana. Panitini
PASS GCA Solution Centre Pvt Ltd.
601 Aditya Trade Centre, Ameerpet, 
Hyderabad, India. 
www.pass-consulting.com 



-----Original Message-----
From: Paul Libbrecht [mailto:paul@activemath.org] 
Sent: Tuesday, September 13, 2005 3:40 PM
To: java-user@lucene.apache.org
Subject: Re: Spliting of words

Madhu,

Analyzer is the magic word here.

Lucene's StandardAnalyzer has a whole grammar to split words into 
tokens. There are many more analyzers, most of which are language 
specific (e.g. based the Snowball or Porter-stemmers, see contribs or 
javadoc of core).

For which language do wish to use that ?

paul


Le 13 sept. 05, à 11:45, Madhu Satyanarayana Panitini a écrit :

> Hai all
>
> I want know the split pattern of text before indexing in Lucene, its
> splits where ever there is space in between the words Or is there any
> pattern in splitting the words of text document. In which program I
can
> find the code on the splitting of the word.
>
> Madhu
>
> Madhu Satyanarayana. Panitini
> PASS GCA Solution Centre Pvt Ltd.
> 601 Aditya Trade Centre, Ameerpet,
> Hyderabad, India.
> www.pass-consulting.com
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message