Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 28381 invoked from network); 7 Jul 2007 13:14:32 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 7 Jul 2007 13:14:32 -0000 Received: (qmail 73014 invoked by uid 500); 7 Jul 2007 13:14:29 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 72085 invoked by uid 500); 7 Jul 2007 13:14:27 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 72074 invoked by uid 99); 7 Jul 2007 13:14:27 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 07 Jul 2007 06:14:27 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: domain of ryanackley@gmail.com designates 209.85.146.179 as permitted sender) Received: from [209.85.146.179] (HELO wa-out-1112.google.com) (209.85.146.179) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 07 Jul 2007 06:14:23 -0700 Received: by wa-out-1112.google.com with SMTP id j40so694210wah for ; Sat, 07 Jul 2007 06:14:03 -0700 (PDT) DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=GtbjfADYcJZkCcjL39Wz9p0vIHixj34VvkLonG5fO1mdjOrMylRFEO/L2ovBWocsAES0ckP2kq3eb/k0kZsJRlpxxfMVfpj6LdpyKpoq5g/vNxPGJ3YT4Aaff22trFKOq/rR2AgNnrSCQvtdtGgvfCwF13rgsRH2n6fRNu+2HEk= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=XsMsXCF42b1xq6I2Mu32Z+NtJpZap01PNOoKVeWS12quUSubic8blt4JoLwqkLW6vsqRGJYyQevuXHdVTE2klW8R7xdR1JQG8EW3kQ3Nlkc/5S1SRR4SJxx3wFiTps1aU388xB+ndZEIMrpL3K+/vVmp+2e9QdRzOG8McRf6Gtc= Received: by 10.114.61.1 with SMTP id j1mr1531606waa.1183814042782; Sat, 07 Jul 2007 06:14:02 -0700 (PDT) Received: by 10.115.94.10 with HTTP; Sat, 7 Jul 2007 06:14:02 -0700 (PDT) Message-ID: Date: Sat, 7 Jul 2007 08:14:02 -0500 From: "Ryan Ackley" To: java-user@lucene.apache.org Subject: Re: Related Article question In-Reply-To: <11474031.post@talk.nabble.com> MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline References: <11474031.post@talk.nabble.com> X-Virus-Checked: Checked by ClamAV on apache.org I was playing around with MoreLikeThis and I noticed the problems you are talking about as well. One idea I thought of was for MoreLikeThis to focus only on proper nouns for the purposes of similarity or give a significant boost to those. Pretty much the same idea you had in #1. I found a list of the 1000 most used English words somewhere on the net (including stemmed variations, see link below). This would be one way to look for proper nouns. The idea is if the term for the MoreLikeThis query is in that list, don't use it. This is a good resource: http://www1.harenet.ne.jp/~waring/vocab/wordlists/vocfreq.html On 7/6/07, sdeck wrote: > > Hello all, > I have been trying out the MoreLikeThis and many other similarity types of > queries, but still run into problems with content not being matched up. > > Let me give an example, as well as some question that, hopefully someone can > answer, to help me refine my work. > > Example: > 1) Document A may have a title: Oden and Durant Are being recruited, and > Document B would have a title > Trailblazers look at Oden and Durant. > Both Document A and B talk about the recruitment of Oden and Durant, just > in fairly different ways. One may emphasis Oden over Durant, or vice versa. > The way the MoreLikeThis and similarity queries seem to work is that they > take terms and see if a lot of them match up in the documents. So, if Durant > is ins doc A 10 times and 10 times in doc B, then the similarity will be > higher. > > Here is my problem though. I run these morelike this and other similarity > queries and it many of those types of articles do not get matched, because a > lot of the terms are not the same, but they are talking about the same > topic. > > Here is what I wonder > 1) Should I somehow give more boost to a full name, or other names, or > titles to help matching? Or, does that hinder things? > 2) How does shorter content versus longer content work? I make only get > around 5-6 sentences in one document, but a full page in another, but they > are still talking about the same thing > 3) How would term vectors help, versus not storing term vectors? > > To also help, the way the system is setup, I have one main index. I will > run a search of the web and collect more documents. Before adding these to > the main index, I will run a morelikethis query against the main index of > each of the new documents to be inserted. That way, I can keep a separate > place of what articles are related to each other for faster lookups. I also > do a query of morelikethis against the new index, just to see what recently > searched articles are similar to each other. > It would seem that document frequency and term numbers will not really work > in these sorts of scenarios. > > Not sure if I am explaining my problem as well as I can, but I would love > some kind of reference to figuring out how to do related article searching > and see how I can refine my results. Right now, I would say about 60-70% get > correctly mapped into related articles, and about 10-20 percent get > incorrectly mapped as a related article (similar terms, but perhaps not > enough content, but the article is not about any of the others) > > Any help would be appreciated. > Thanks > Scott > -- > View this message in context: http://www.nabble.com/Related-Article-question-tf4038641.html#a11474031 > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org