Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@www.apache.org Received: (qmail 46697 invoked from network); 14 Nov 2003 18:11:09 -0000 Received: from daedalus.apache.org (HELO mail.apache.org) (208.185.179.12) by minotaur-2.apache.org with SMTP; 14 Nov 2003 18:11:09 -0000 Received: (qmail 67781 invoked by uid 500); 14 Nov 2003 18:10:56 -0000 Delivered-To: apmail-jakarta-lucene-user-archive@jakarta.apache.org Received: (qmail 67753 invoked by uid 500); 14 Nov 2003 18:10:56 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 67740 invoked from network); 14 Nov 2003 18:10:55 -0000 Received: from unknown (HELO gremlin.ics.uci.edu) (128.195.1.70) by daedalus.apache.org with SMTP; 14 Nov 2003 18:10:55 -0000 Received: from ics.uci.edu (pv105178.reshsg.uci.edu [128.195.105.178]) by gremlin.ics.uci.edu (8.12.10/8.12.10) with ESMTP id hAEI9XLE002325 for ; Fri, 14 Nov 2003 10:09:33 -0800 (PST) Date: Fri, 14 Nov 2003 10:14:29 -0800 Subject: inter-term correlation [was Re: Vector Space Model in Lucene?] Content-Type: text/plain; charset=US-ASCII; format=flowed Mime-Version: 1.0 (Apple Message framework v552) From: "Joshua O'Madadhain" To: "Lucene Users List" Content-Transfer-Encoding: 7bit In-Reply-To: <33D5BBBB077CAD47AA4F225359F4A5E401240864@ny2528.corp.bloomberg.com> Message-Id: <63585AE0-16CE-11D8-BFA7-000A9591BCE8@ics.uci.edu> X-Mailer: Apple Mail (2.552) X-ICS-MailScanner: Found to be clean X-ICS-MailScanner-SpamCheck: not spam (whitelisted), SpamAssassin (score=-118.2, required 5, DOUBLE_CAPSWORD, EMAIL_ATTRIBUTION, IN_REP_TO, MIME_EXCESSIVE_QP, QUOTED_EMAIL_TEXT, REPLY_WITH_QUOTES, USER_AGENT_APPLEMAIL, USER_IN_WHITELIST) X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N Incorporating inter-term correlation into Lucene isn't that hard; I've done it. Nor is it incompatible with the vector-space model. I'm not happy with the specific correlation metric that I picked, which is why I'm not eager to generally release the code I wrote, but I think that the basic mechanism that I came up with (query expansion via correlated terms, where the added terms were boosted according to the strength of the correlation) is fairly sound. And I didn't need any changes to Lucene to do this. You can get some details on the specific mechanism that I used here, if you're interested: http://www.ics.uci.edu/~jmadden/research/index.html (and go down to "Fuzzy Term Expansion and Document Reweighting", about halfway down.) If you decide that my ideas are interesting enough that you want to have a look at my code, let me know, and perhaps we can work something out. Regards, Joshua O'Madadhain On Friday, Nov 14, 2003, at 09:52 US/Pacific, Chong, Herb wrote: > i don't know of any open source search engine that incorporates > interterm correlation. i have been looking into how to do this in > Lucene and so far, it's not been promising. the indexing engine and > file format needs to be changed. there are very few search engines > that incorporate interterm correlation in any mathematically and > linguistically rigorous manner. i designed a couple, but they were all > research experiments. > > if you are familiar with the TREC automatic adhoc track? my > experiments with the TREC-5 to TREC-7 questions produced about 0.05 to > 0.10 improvement in average precision by proper use of interterm > correlation. my project at the time was cancelled after TREC-7 and so > there haven't been any new developments. > jmadden@ics.uci.edu...Obscurium Per Obscurius...www.ics.uci.edu/~jmadden Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall It's that moment of dawning comprehension that I live for--Bill Watterson My opinions are too rational and insightful to be those of any organization. --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org