Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@apache.org Received: (qmail 11159 invoked from network); 31 Mar 2003 16:44:49 -0000 Received: from exchange.sun.com (192.18.33.10) by daedalus.apache.org with SMTP; 31 Mar 2003 16:44:49 -0000 Received: (qmail 989 invoked by uid 97); 31 Mar 2003 16:46:39 -0000 Delivered-To: qmlist-jakarta-archive-lucene-dev@nagoya.betaversion.org Received: (qmail 982 invoked from network); 31 Mar 2003 16:46:39 -0000 Received: from daedalus.apache.org (HELO apache.org) (208.185.179.12) by nagoya.betaversion.org with SMTP; 31 Mar 2003 16:46:39 -0000 Received: (qmail 10844 invoked by uid 500); 31 Mar 2003 16:44:45 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 10810 invoked from network); 31 Mar 2003 16:44:45 -0000 Received: from unknown (HELO www18.dixiesys.com) (209.51.150.6) by daedalus.apache.org with SMTP; 31 Mar 2003 16:44:45 -0000 Received: from Lissus (s15.dial3.sne.nac.net [64.21.105.15]) (authenticated (0 bits)) by www18.dixiesys.com (8.11.6/8.11.6) with ESMTP id h2VGihg19462 for ; Mon, 31 Mar 2003 10:44:44 -0600 Reply-To: From: "Alex Murzaku" To: "'Lucene Developers List'" Subject: RE: Proposal: Statistical Stopword elimination Date: Mon, 31 Mar 2003 11:44:26 -0500 Organization: LISSUS llc Message-ID: <000201c2f7a4$d2f2f430$6501000a@Lissus> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook, Build 10.0.4024 Importance: Normal In-Reply-To: <3B48940F2D7712428BD31A041A367DDC015962@lrrr> X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1106 X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N I was also wondering about this... don't know well the internals of Lucene though to give you any smart implementation feedback. In my opinion, it would be a very useful addition. I would just add that, if the frequent term is the only term in the query, it should not be eliminated. I just tried Google and it behaves the same way. Very frequent terms ARE indexed. They get removed only when they are part of a query with more than one term. --=20 Alex Murzaku ___________________________________________ alex(at)lissus.com http://www.lissus.com =20 -----Original Message----- From: Karsten Konrad [mailto:Karsten.Konrad@xtramind.com]=20 Sent: Monday, March 31, 2003 11:30 AM To: Lucene Developers List Subject: Proposal: Statistical Stopword elimination Hi, I am experimenting with long queries (parts of documents as search query), and I would like to filter all terms with high document frequencies when searching. I.e., a kind of statistical, language independent stop word=20 elimination while searching. For this, I have introduced a frequency limit factor into Similarity and test for excessively high document frequencies in the TermQuery. The code looks somewhat like this: =20 >> public Scorer scorer(IndexReader reader) throws IOException { TermDocs termDocs =3D reader.termDocs(term); =20 if (termDocs =3D=3D null) return null; =20 float limit =3D searcher.getSimilarity().getLimitFactor(); int docFreq =3D searcher.docFreq(term); int max =3D searcher.maxDoc(); if (docFreq >=3D (max+1)*limit) return null; return new TermScorer(this, termDocs, searcher.getSimilarity(), reader.norms(term.field())); } >> A limit factor of 0.2 will then remove all terms from the search that appear in more than (approximately) 20% of the documents. For long queries, the search time is reduced - about factor 2 even on shorter text queries. A factor of 1.0 or=20 higher will give you identical results to the original version. Also,=20 highlighting often looks better as only less frequent terms are highlighted. While the terms removed stay in the index and therefore can still be searched, we can speed up more complex searches by this method. My questions: (1) Is there some more elegant way of doing this? E.g., access to the docFreq is done again in the TermScorer and I would like to remove this redundancy. (2) Is this a worthwhile contribution to Lucene's features in your opinion? Comments appreciated, -- Dr.-Ing. Karsten Konrad Head of Information Agent Engineering XtraMind Technologies GmbH Stuhlsatzenhausweg 3 D-66123 Saarbr=FCcken Phone: +49 (681) 3025113 Fax: +49 (681) 3025109 konrad@xtramind.com www.xtramind.com --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-dev-help@jakarta.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-dev-help@jakarta.apache.org