Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
Reply-To: <lists@lissus.com>
From: "Alex Murzaku" <lists@lissus.com>
To: "'Lucene Developers List'" <lucene-dev@jakarta.apache.org>
Subject: RE: Proposal: Statistical Stopword elimination
Date: Mon, 31 Mar 2003 11:44:26 -0500
Organization: LISSUS llc
Message-ID: <000201c2f7a4$d2f2f430$6501000a@Lissus>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Importance: Normal
In-Reply-To: <3B48940F2D7712428BD31A041A367DDC015962@lrrr>

I was also wondering about this... don't know well the internals of
Lucene though to give you any smart implementation feedback. In my
opinion, it would be a very useful addition. I would just add that, if
the frequent term is the only term in the query, it should not be
eliminated. I just tried Google and it behaves the same way. Very
frequent terms ARE indexed. They get removed only when they are part of
a query with more than one term.

--=20
Alex Murzaku
___________________________________________
 alex(at)lissus.com  http://www.lissus.com           =20

-----Original Message-----
From: Karsten Konrad [mailto:Karsten.Konrad@xtramind.com]=20
Sent: Monday, March 31, 2003 11:30 AM
To: Lucene Developers List
Subject: Proposal: Statistical Stopword elimination


Hi,

I am experimenting with long queries (parts of documents as search
query), and I would like to filter all terms with high document
frequencies when searching. I.e., a kind of statistical, language
independent stop word=20
elimination while searching.

For this, I have introduced a frequency limit factor into Similarity and
test for excessively high document frequencies in the TermQuery. The
code looks somewhat like this:

 =20
>>
    public Scorer scorer(IndexReader reader) throws IOException {
      TermDocs termDocs =3D reader.termDocs(term);
     =20
      if (termDocs =3D=3D null)
        return null;
     =20
      float limit =3D searcher.getSimilarity().getLimitFactor();
      int docFreq =3D searcher.docFreq(term);
      int max =3D searcher.maxDoc();
      if (docFreq >=3D (max+1)*limit)
            return null;

 	return new TermScorer(this, termDocs, searcher.getSimilarity(),
      reader.norms(term.field()));
    }
>>

A limit factor of 0.2 will then remove all terms from the search that
appear in more than (approximately) 20% of the documents. For long
queries, the search time is reduced - about factor 2 even on shorter
text queries. A factor of 1.0 or=20
higher  will give  you identical results to the original version. Also,=20
highlighting often looks better as only less frequent terms are
highlighted.

While the terms removed stay in the index and therefore can still be
searched, we can speed up more complex searches by this method.

My questions:

(1) Is there some more elegant way of doing this? E.g., access to the
docFreq is done again in the TermScorer and I would like to remove this
redundancy.

(2) Is this a worthwhile contribution to Lucene's features in your
opinion?

Comments appreciated,

--

Dr.-Ing. Karsten Konrad
Head of Information Agent Engineering

XtraMind Technologies GmbH
Stuhlsatzenhausweg 3
D-66123 Saarbr=FCcken
Phone: +49 (681) 3025113
Fax: +49 (681) 3025109
konrad@xtramind.com
www.xtramind.com


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org