lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Stop words (how to create ideal set of stop words?)
Date Fri, 11 May 2007 00:34:33 GMT
Also, from the empirical side, have a look at Luke (after indexing w/ 
o any stopwords, or just the standard ones) and see what the most  
common terms are and see if they are meaningful or not in the context  
of your application.

-Grant


On May 10, 2007, at 7:41 PM, Doron Cohen wrote:

> See also  en.wikipedia.org/wiki/Stop_words  and
> www.ranks.nl/tools/stopwords.html
>
> karl wettin <karl.wettin@gmail.com> wrote on 10/05/2007 13:57:33:
>
>>
>> 10 maj 2007 kl. 20.39 skrev Lukas Vlcek:
>>
>>> Can anybody point me to some references how to create an ideal set
>>> of stop
>>> words? I konw that this is more like a theoretical question but  
>>> how do
>>> Luceners determine which words shuold be excluded when creating
>>> Analyzers
>>> for a new languages?
>>
>> The idea with stop words is to keep the index as small as possible
>> without major loss of features, thus they ought to be frequently
>> occuring words with little or no semantic meaning. What these words
>> are really depends on language, corpus, et c.
>>
>>> And which technique was used for validation of stop
>>> word lists in current Analyzers?
>>
>> My guess is that they are manually choosen from a corpus term
>> frequency vector.
>>
>>> More specificaly I am interested in situations when there is a need
>>> to build
>>> a search engine around specific corpus (for example when we need to
>>> search
>>> set of articles related to programming languages only). Given a
>>> specific
>>> corpus is there any recommended technique of stop words derivation?
>>
>> If you have no knowledge of the language for wich you wish to produce
>> stop words, then it will be fairly hard to know what to consider a
>> stop word. You might be able to consider it as a text classification
>> problem. Feature/attribute selection for classifiers is a well
>> researched subject. Weka, Yale, R, et c are all tools that might help
>> you. But I honestly think no matter how you turn and twist the data,
>> manually choosing the stop words is the way to go.
>>
>>
>> --
>> karl
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message