lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Stop words (how to create ideal set of stop words?)
Date Fri, 11 May 2007 00:34:33 GMT
Also, from the empirical side, have a look at Luke (after indexing w/ 
o any stopwords, or just the standard ones) and see what the most  
common terms are and see if they are meaningful or not in the context  
of your application.


On May 10, 2007, at 7:41 PM, Doron Cohen wrote:

> See also  and
> karl wettin <> wrote on 10/05/2007 13:57:33:
>> 10 maj 2007 kl. 20.39 skrev Lukas Vlcek:
>>> Can anybody point me to some references how to create an ideal set
>>> of stop
>>> words? I konw that this is more like a theoretical question but  
>>> how do
>>> Luceners determine which words shuold be excluded when creating
>>> Analyzers
>>> for a new languages?
>> The idea with stop words is to keep the index as small as possible
>> without major loss of features, thus they ought to be frequently
>> occuring words with little or no semantic meaning. What these words
>> are really depends on language, corpus, et c.
>>> And which technique was used for validation of stop
>>> word lists in current Analyzers?
>> My guess is that they are manually choosen from a corpus term
>> frequency vector.
>>> More specificaly I am interested in situations when there is a need
>>> to build
>>> a search engine around specific corpus (for example when we need to
>>> search
>>> set of articles related to programming languages only). Given a
>>> specific
>>> corpus is there any recommended technique of stop words derivation?
>> If you have no knowledge of the language for wich you wish to produce
>> stop words, then it will be fairly hard to know what to consider a
>> stop word. You might be able to consider it as a text classification
>> problem. Feature/attribute selection for classifiers is a well
>> researched subject. Weka, Yale, R, et c are all tools that might help
>> you. But I honestly think no matter how you turn and twist the data,
>> manually choosing the stop words is the way to go.
>> --
>> karl
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

Grant Ingersoll
Center for Natural Language Processing

Read the Lucene Java FAQ at 

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message