Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 51188 invoked from network); 11 May 2007 00:35:04 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 11 May 2007 00:35:04 -0000 Received: (qmail 61601 invoked by uid 500); 11 May 2007 00:35:05 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 61079 invoked by uid 500); 11 May 2007 00:35:03 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 61068 invoked by uid 99); 11 May 2007 00:35:03 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 10 May 2007 17:35:03 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: neutral (herse.apache.org: local policy) Received: from [208.97.132.66] (HELO spunkymail-a5.g.dreamhost.com) (208.97.132.66) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 10 May 2007 17:34:56 -0700 Received: from [192.168.0.2] (adsl-074-229-189-244.sip.rmo.bellsouth.net [74.229.189.244]) by spunkymail-a5.g.dreamhost.com (Postfix) with ESMTP id 9C61114D6AA for ; Thu, 10 May 2007 17:34:33 -0700 (PDT) Mime-Version: 1.0 (Apple Message framework v752.2) In-Reply-To: References: Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed Message-Id: <894CA52B-4E42-4524-B7A7-0AA4C23E2095@apache.org> Content-Transfer-Encoding: 7bit From: Grant Ingersoll Subject: Re: Stop words (how to create ideal set of stop words?) Date: Thu, 10 May 2007 20:34:33 -0400 To: java-user@lucene.apache.org X-Mailer: Apple Mail (2.752.2) X-Virus-Checked: Checked by ClamAV on apache.org Also, from the empirical side, have a look at Luke (after indexing w/ o any stopwords, or just the standard ones) and see what the most common terms are and see if they are meaningful or not in the context of your application. -Grant On May 10, 2007, at 7:41 PM, Doron Cohen wrote: > See also en.wikipedia.org/wiki/Stop_words and > www.ranks.nl/tools/stopwords.html > > karl wettin wrote on 10/05/2007 13:57:33: > >> >> 10 maj 2007 kl. 20.39 skrev Lukas Vlcek: >> >>> Can anybody point me to some references how to create an ideal set >>> of stop >>> words? I konw that this is more like a theoretical question but >>> how do >>> Luceners determine which words shuold be excluded when creating >>> Analyzers >>> for a new languages? >> >> The idea with stop words is to keep the index as small as possible >> without major loss of features, thus they ought to be frequently >> occuring words with little or no semantic meaning. What these words >> are really depends on language, corpus, et c. >> >>> And which technique was used for validation of stop >>> word lists in current Analyzers? >> >> My guess is that they are manually choosen from a corpus term >> frequency vector. >> >>> More specificaly I am interested in situations when there is a need >>> to build >>> a search engine around specific corpus (for example when we need to >>> search >>> set of articles related to programming languages only). Given a >>> specific >>> corpus is there any recommended technique of stop words derivation? >> >> If you have no knowledge of the language for wich you wish to produce >> stop words, then it will be fairly hard to know what to consider a >> stop word. You might be able to consider it as a text classification >> problem. Feature/attribute selection for classifiers is a well >> researched subject. Weka, Yale, R, et c are all tools that might help >> you. But I honestly think no matter how you turn and twist the data, >> manually choosing the stop words is the way to go. >> >> >> -- >> karl > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > -------------------------- Grant Ingersoll Center for Natural Language Processing http://www.cnlp.org/tech/lucene.asp Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ LuceneFAQ --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org