lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mark harwood <markharw...@yahoo.co.uk>
Subject Re: Stop words (how to create ideal set of stop words?)
Date Fri, 11 May 2007 11:39:49 GMT
>For some reason, the application of zipf's law comes to mind, whereby  
>you could look at the most commonly occurring words and  
>mathematically deduce which ones are "too" common, but where your  
>cutoff is still may be difficult to choose.

I recently gave Andrzej a "Zipf visualisation" plug-in for Luke which may help. I can post
the code somewhere if this is useful.

Mark



----- Original Message ----
From: Grant Ingersoll <gsingers@apache.org>
To: java-user@lucene.apache.org
Sent: Friday, 11 May, 2007 12:14:12 PM
Subject: Re: Stop words (how to create ideal set of stop words?)

Use Lucener's tend to be more practically oriented!  :-)

For some reason, the application of zipf's law comes to mind, whereby  
you could look at the most commonly occurring words and  
mathematically deduce which ones are "too" common, but where your  
cutoff is still may be difficult to choose.  You will always have the  
balance between losing some information and shrinking the index, etc.

Google Scholar search for "zipf's law +stopwords" yields: http:// 
ir.dcs.gla.ac.uk/terrier/publications/rtlo_DIRpaper.pdf  which looks  
like it holds promise (but I admit I didn't read beyond the abstract)  
as it has references to old approaches for the same task, plus a  
"new" approach.

Good Luck and if you find something that works well, we would love to  
have it contributed back!

-Grant



On May 11, 2007, at 1:53 AM, Lukas Vlcek wrote:

> Hi,
>
> Thanks for your comments!
>
> I was thinking that there could be some method based on frequency and
> linguistic research. So far it seems that manually choosen set of  
> words is
> very common approach but this leaves some questions opened in my mind.
> I am not a native english speaker but I think that this (
> http://www.ranks.nl/tools/stopwords.html) makes sense, but for my  
> native
> language (http://www.ranks.nl/stopwords/czech.html) this can be  
> questionable
> in some cases (especially in case of specific corpus).
>
> What I am searching for is some authomatic method of stop words  
> extraction
> based on given set of documents. I don't expect such method to be  
> 100% exact
> but I would expect it to be ~good enough~.
>
> I will try to search in citeseer as well (was hoping somebody could  
> give me
> some references of this kind).
>
> Thanks!
> Lukas
>
> On 5/11/07, Otis Gospodnetic <otis_gospodnetic@yahoo.com> wrote:
>>
>> There is a handy class in contrib/misc.../ that will show you the  
>> most
>> frequent terms in an index. Handy dandy.
>>
>> Otis
>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
>> Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
>>
>> ----- Original Message ----
>> From: Lukas Vlcek <lukas.vlcek@gmail.com>
>> To: java-user@lucene.apache.org
>> Sent: Thursday, May 10, 2007 2:39:35 PM
>> Subject: Stop words (how to create ideal set of stop words?)
>>
>> Hi,
>>
>> Can anybody point me to some references how to create an ideal set  
>> of stop
>> words? I konw that this is more like a theoretical question but  
>> how do
>> Luceners determine which words shuold be excluded when creating  
>> Analyzers
>> for a new languages? And which technique was used for validation  
>> of stop
>> word lists in current Analyzers?
>>
>> More specificaly I am interested in situations when there is a  
>> need to
>> build
>> a search engine around specific corpus (for example when we need  
>> to search
>> set of articles related to programming languages only). Given a  
>> specific
>> corpus is there any recommended technique of stop words derivation?
>>
>> Thanks,
>> Lukas
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>

--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org






		
___________________________________________________________ 
Copy addresses and emails from any email account to Yahoo! Mail - quick, easy and free. http://uk.docs.yahoo.com/trueswitch2.html

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message