lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From karl wettin <>
Subject Re: Stop words (how to create ideal set of stop words?)
Date Thu, 10 May 2007 20:57:33 GMT

10 maj 2007 kl. 20.39 skrev Lukas Vlcek:

> Can anybody point me to some references how to create an ideal set  
> of stop
> words? I konw that this is more like a theoretical question but how do
> Luceners determine which words shuold be excluded when creating  
> Analyzers
> for a new languages?

The idea with stop words is to keep the index as small as possible  
without major loss of features, thus they ought to be frequently  
occuring words with little or no semantic meaning. What these words  
are really depends on language, corpus, et c.

> And which technique was used for validation of stop
> word lists in current Analyzers?

My guess is that they are manually choosen from a corpus term  
frequency vector.

> More specificaly I am interested in situations when there is a need  
> to build
> a search engine around specific corpus (for example when we need to  
> search
> set of articles related to programming languages only). Given a  
> specific
> corpus is there any recommended technique of stop words derivation?

If you have no knowledge of the language for wich you wish to produce  
stop words, then it will be fairly hard to know what to consider a  
stop word. You might be able to consider it as a text classification  
problem. Feature/attribute selection for classifiers is a well  
researched subject. Weka, Yale, R, et c are all tools that might help  
you. But I honestly think no matter how you turn and twist the data,  
manually choosing the stop words is the way to go.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message