Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@apache.org Received: (qmail 50134 invoked from network); 3 Apr 2002 17:15:38 -0000 Received: from unknown (HELO nagoya.betaversion.org) (192.18.49.131) by daedalus.apache.org with SMTP; 3 Apr 2002 17:15:38 -0000 Received: (qmail 1651 invoked by uid 97); 3 Apr 2002 17:15:38 -0000 Delivered-To: qmlist-jakarta-archive-lucene-user@jakarta.apache.org Received: (qmail 1632 invoked by uid 97); 3 Apr 2002 17:15:37 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 1621 invoked from network); 3 Apr 2002 17:15:37 -0000 Subject: RE: search similar docs? Date: Wed, 3 Apr 2002 09:15:26 -0800 MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Message-ID: <728DA21B8941A843A7C496F1ACF48518012CFB5E@gleam.lumos.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Content-Class: urn:content-classes:message X-MimeOLE: Produced By Microsoft Exchange V6.0.4712.0 Thread-Topic: search similar docs? Thread-Index: AcGz+7nnUVXFigqLSP2zii8nNTWDTAnNwYdA From: "Spencer, Dave" To: "Lucene Users List" X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N Can't you feed the text of the orig/matching doc to the search engine as a query and see what docs it returns? Then a "similar" has words in common w/ the orig doc. I've done this kind of monster query with some of our internal systems - we have support mail that comes in, and all kinds of intranet sites, bug databases, and javadoc indexed. When reading the support mail in a web interface you can feed the *entire* body of the mail to the search engine to find "similar" support mails, bug reports, howto docs etc. Not fast and just a proof of concept right now but kinda intersting. [note: had to switch the
action from GET to POST due to the size of the "query"] -----Original Message----- From: Daniel Calvo [mailto:dcalvo@ig.com.br] Sent: Tuesday, February 12, 2002 12:25 PM To: Lucene Users List Subject: search similar docs? Hi, I was thinking of implementing a search for similar documents (like some commercial search engines do) and wondering if anyone has already done something like that with Lucene. I thought of collecting all terms of the selected document (or maybe some subset of them) and then creating a MultiTermQuery containing those terms. Does it make sense? Is there a better way to achieve this? In order to do it, I would have to get all terms of a given document and so far I haven't found an easy way of doing it (I hope there's one ;-). The way I was thinking is to extend FilteredTermEnum but, instead of selecting terms by similarity, select them by docid (for each term, get its termdocs and check for the desired docid). It doesn't look very efficient so if someone could contribute with other ideas or even related experiences I'd appreciate very much. TIA Best regards, --Daniel -- To unsubscribe, e-mail: For additional commands, e-mail: -- To unsubscribe, e-mail: For additional commands, e-mail: