Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 73169 invoked from network); 3 Dec 2003 22:24:59 -0000 Received: from daedalus.apache.org (HELO mail.apache.org) (208.185.179.12) by minotaur-2.apache.org with SMTP; 3 Dec 2003 22:24:59 -0000 Received: (qmail 40449 invoked by uid 500); 3 Dec 2003 22:24:07 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 40406 invoked by uid 500); 3 Dec 2003 22:24:06 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 40353 invoked from network); 3 Dec 2003 22:24:06 -0000 Received: from unknown (HELO mousepad.xtramind.dfki.de) (134.96.191.5) by daedalus.apache.org with SMTP; 3 Dec 2003 22:24:06 -0000 Received: from localhost (localhost [127.0.0.1]) by mousepad.xtramind.dfki.de (Postfix) with ESMTP id CCA667D19 for ; Wed, 3 Dec 2003 23:24:11 +0100 (MET) Received: from omicron.win.xtramind.dfki.de (omicron.xtramind.dfki.de [192.168.4.37]) by mousepad.xtramind.dfki.de (Postfix) with ESMTP id 63DF87D14 for ; Wed, 3 Dec 2003 23:24:09 +0100 (MET) X-MimeOLE: Produced By Microsoft Exchange V6.0.6249.0 content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Subject: AW: Document Similarity Date: Wed, 3 Dec 2003 23:24:09 +0100 Message-ID: <3B48940F2D7712428BD31A041A367DDC438818@lrrr> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Document Similarity Thread-Index: AcO494Tzrl0gnKwrR0KBcsmCP3AJXgA8ONHA From: "Karsten Konrad" To: "Lucene Users List" X-Virus-Scanned: by AMaViS with Sophos Sweep X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N Hi, >> Do they produce same ranking results?=20 No; Lucene's operations on query weight and length normalization is not equivalent to a vanilla cosine in vector space. >> I guess the 2nd approach will be more precise but slow. Query similarity=20 will indeed be faster, but may actually not be worse. A straightforward = cosine without IDF weighting of terms (as Lucene does) will almost = certainly=20 be less precise if you have documents of different length - word occurence probabilities in texts of different lengths vary greatly, and the cosine of independent longer texts will often be greater than=20 those that actually have the same topic, but are short, just because=20 of randomly found non-content words. If, on the other hand, you choose the right TF/IDF weighting of=20 terms, the cosine in this warped vector space could be (a)=20 equivalent to the one Lucene does - requires some work to do so, or=20 (b) might even get better on average. However, the last time I counted, there where about 250 different=20 TF/IDF formulas around in IR publications, machine learning, computational linguistics and so on. Performance depends on domain and language.=20 But if I was you, I just would start playing and have fun with the stuff... Karsten -----Urspr=FCngliche Nachricht----- Von: Jing Su [mailto:J.Su@cs.bham.ac.uk]=20 Gesendet: Dienstag, 2. Dezember 2003 18:12 An: lucene-user@jakarta.apache.org Betreff: Document Similarity Hi, I have read some posts in user/developer archives about Lucene-based = document similarity comparison. In summary there are two approaches are mentioned: 1 - Construct document to a query; 2 - Calculate each document to be a vector, then rank accoring to their = distance (cosine). Do they produce same ranking results? Is there any other way to do so? I = guess the 2nd approach will be more precise but slow. Thanks. Jing --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org