lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lukas Vlcek" <lukas.vl...@gmail.com>
Subject Re: Stop words (how to create ideal set of stop words?)
Date Fri, 11 May 2007 05:53:57 GMT
Hi,

Thanks for your comments!

I was thinking that there could be some method based on frequency and
linguistic research. So far it seems that manually choosen set of words is
very common approach but this leaves some questions opened in my mind.
I am not a native english speaker but I think that this (
http://www.ranks.nl/tools/stopwords.html) makes sense, but for my native
language (http://www.ranks.nl/stopwords/czech.html) this can be questionable
in some cases (especially in case of specific corpus).

What I am searching for is some authomatic method of stop words extraction
based on given set of documents. I don't expect such method to be 100% exact
but I would expect it to be ~good enough~.

I will try to search in citeseer as well (was hoping somebody could give me
some references of this kind).

Thanks!
Lukas

On 5/11/07, Otis Gospodnetic <otis_gospodnetic@yahoo.com> wrote:
>
> There is a handy class in contrib/misc.../ that will show you the most
> frequent terms in an index. Handy dandy.
>
> Otis
> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
> Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
>
> ----- Original Message ----
> From: Lukas Vlcek <lukas.vlcek@gmail.com>
> To: java-user@lucene.apache.org
> Sent: Thursday, May 10, 2007 2:39:35 PM
> Subject: Stop words (how to create ideal set of stop words?)
>
> Hi,
>
> Can anybody point me to some references how to create an ideal set of stop
> words? I konw that this is more like a theoretical question but how do
> Luceners determine which words shuold be excluded when creating Analyzers
> for a new languages? And which technique was used for validation of stop
> word lists in current Analyzers?
>
> More specificaly I am interested in situations when there is a need to
> build
> a search engine around specific corpus (for example when we need to search
> set of articles related to programming languages only). Given a specific
> corpus is there any recommended technique of stop words derivation?
>
> Thanks,
> Lukas
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message