Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (herse.apache.org: domain of ryanackley@gmail.com
 designates 209.85.146.179 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=beta;
        h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references;
        b=XsMsXCF42b1xq6I2Mu32Z+NtJpZap01PNOoKVeWS12quUSubic8blt4JoLwqkLW6vsqRGJYyQevuXHdVTE2klW8R7xdR1JQG8EW3kQ3Nlkc/5S1SRR4SJxx3wFiTps1aU388xB+ndZEIMrpL3K+/vVmp+2e9QdRzOG8McRf6Gtc=
Message-ID: <e15c52dc0707070614j7bcc9fc6vbb1c4e9218919d11@mail.gmail.com>
Date: Sat, 7 Jul 2007 08:14:02 -0500
From: "Ryan Ackley" <ryanackley@gmail.com>
To: java-user@lucene.apache.org
Subject: Re: Related Article question
In-Reply-To: <11474031.post@talk.nabble.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
References: <11474031.post@talk.nabble.com>

I was playing around with MoreLikeThis and I noticed the problems you
are talking about as well.

One idea I thought of was for MoreLikeThis to focus only on proper
nouns for the purposes of similarity or give a significant boost to
those. Pretty much the same idea you had in #1.

I found a list of the 1000 most used English words somewhere on the
net (including stemmed variations, see link below). This would be one
way to look for proper nouns. The idea is if the term for the
MoreLikeThis query is in that list, don't use it.

This is a good resource:

http://www1.harenet.ne.jp/~waring/vocab/wordlists/vocfreq.html

On 7/6/07, sdeck <scott.decker@gmail.com> wrote:
>
> Hello all,
>  I have been trying out the MoreLikeThis and many other similarity types of
> queries, but still run into problems with content not being matched up.
>
> Let me give an example, as well as some question that, hopefully someone can
> answer, to help me refine my work.
>
> Example:
> 1) Document A may have a title: Oden and Durant Are being recruited, and
> Document B would have a title
> Trailblazers look at Oden and Durant.
>  Both Document A and B talk about the recruitment of Oden and Durant, just
> in fairly different ways.  One may emphasis Oden over Durant, or vice versa.
>  The way the MoreLikeThis and similarity queries seem to work is that they
> take terms and see if a lot of them match up in the documents. So, if Durant
> is ins doc A 10 times and 10 times in doc B, then the similarity will be
> higher.
>
> Here is my problem though. I run these morelike this and other similarity
> queries and it many of those types of articles do not get matched, because a
> lot of the terms are not the same, but they are talking about the same
> topic.
>
> Here is what I wonder
> 1) Should I somehow give more boost to a full name, or other names, or
> titles to help matching? Or, does that hinder things?
> 2) How does shorter content versus longer content work? I make only get
> around 5-6 sentences in one document, but a full page in another, but they
> are still talking about the same thing
> 3) How would term vectors help, versus not storing term vectors?
>
> To also help, the way the system is setup, I have one main index.  I will
> run a search of the web and collect more documents. Before adding these to
> the main index, I will run a morelikethis query against the main index of
> each of the new documents to be inserted.  That way, I can keep a separate
> place of what articles are related to each other for faster lookups.  I also
> do a query of morelikethis against the new index, just to see what recently
> searched articles are similar to each other.
> It would seem that document frequency and term numbers will not really work
> in these sorts of scenarios.
>
> Not sure if I am explaining my problem as well as I can, but I would love
> some kind of reference to figuring out how to do related article searching
> and see how I can refine my results. Right now, I would say about 60-70% get
> correctly mapped into related articles, and about 10-20 percent get
> incorrectly mapped as a related article (similar terms, but perhaps not
> enough content, but the article is not about any of the others)
>
> Any help would be appreciated.
> Thanks
> Scott
> --
> View this message in context: http://www.nabble.com/Related-Article-question-tf4038641.html#a11474031
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org