lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Spencer <dave-lucene-u...@tropo.com>
Subject Re: similarity of two texts - another question
Date Wed, 02 Jun 2004 18:37:13 GMT
Gerard Sychay wrote:

> Hmm, the term vector does not have to consist of only term frequencies,
> does it? To give weight to rare terms, could you create a term vector of
> (TF*IDF) values for each term?  Then, a distance function would measure
> how many terms two vectors have in common, giving weight to how many
> rare terms two vectors have in common.

Yeah, but if you're gonna do that why not just form a query with all 
words in the source document, and let the Lucene engine do the idf/tf 
calculations? I've done this and it seems to work fine.

Here's code I've used. It could be done better by avoiding QueryParser, 
and odds are it could hit that exception for too many clauses in a 
boolean expression unless you configure lucene from its default, but 
this is the idea. "srch" is the entire body of the source document.


     public static Query formSimilarQuery( String srch, Analyzer a)
		throws org.apache.lucene.queryParser.ParseException, IOException
	{
		StringBuffer sb = new StringBuffer();
		TokenStream ts = a.tokenStream( "foo", new StringReader( srch));
		org.apache.lucene.analysis.Token t;		
		while ( (t = ts.next()) != null)
		{
			sb.append( t.termText() + " ");
		}
		return QueryParser.parse( sb.toString(),DFields.CONTENTS, a);
	}


> 
> 
>>>>David Spencer <dave-lucene-user@tropo.com> 06/01/04 08:25PM >>>
> 
> Erik Hatcher wrote:
> 
> 
>>On Jun 1, 2004, at 4:41 PM, uddam chukmol wrote:
>>
>>
>>>Well, a question again, how does Lucene compute the score between a 
> 
> 
>>>document and a query?
>>
> 
> And I might add, thus, this approach to similarity gives more weight to
> 
> rare terms that match, which one might want for this kind of similarity
> 
> measure.
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message