lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Spencer <>
Subject Re: TFIDF Implementation
Date Tue, 14 Dec 2004 18:50:26 GMT
Bruce Ritchie wrote:

> Christoph,
> I'm not entirely certain if this is what you want, but a while back David Spencer did
code up a 'More Like This' class which can be used for generating similarities between documents.
I can't seem to find this class in the sandbox 

Ot oh, sorry, I'll try to get this checked in soonish. For me it's 
always one thing to do a prelim version of a piece of code, but another 
matter to get it correctly packasged.

> so I've attached it here. Just repackage and test.

An alternate approach to find "similar" docs is to use all (possibly 
unique) tokens in the  source doc to form a large query. This is code I use:

'srch' is the entire untokenized text of the source doc
'a' is the analyzer you want to use
'field' is the field you want to search on e.g. "contents" or "body"
'stop' is an opt set of stop words to ignore

It returns a query, which you then use to search for "similar" docs, and 
then in the return result you need to make sure you ignore the source 
doc, which will prob come back 1st. You can use stemming, synonyms, or 
fuzzy expansion for each term too.

public static Query formSimilarQuery( 
String srch,							Analyzer a,									String field,									Set stop)
throws org.apache.lucene.queryParser.ParseException, IOException
	TokenStream ts = a.tokenStream( "foo", new StringReader( srch));
	org.apache.lucene.analysis.Token t;
	BooleanQuery tmp = new BooleanQuery();
	Set already = new HashSet();
	while ( (t = != null)
		String word = t.termText();
		if ( stop != null &&
			 stop.contains( word)) continue;
		if ( ! already.add( word)) continue;
		TermQuery tq = new TermQuery( new Term( field, word));
		tmp.add( tq, false, false);
	return tmp;


> Regards,
> Bruce Ritchie
>>-----Original Message-----
>>From: Christoph Kiefer [] 
>>Sent: December 14, 2004 11:45 AM
>>To: Lucene Users List
>>Subject: TFIDF Implementation
>>My current task/problem is the following: I need to implement 
>>TFIDF document term ranking using Jakarta Lucene to compute a 
>>similarity rank between arbitrary documents in the constructed index.
>>I saw from the API that there are similar functions already 
>>implemented in the class Similarity and DefaultSimilarity but 
>>I don't know exactly how to use them. At the time my index 
>>has about 25000 (small) documents and there are about 75000 
>>terms stored in total.
>>Now, my question is simple. Does anybody has done this before 
>>or could point me to another location for help?
>>Thanks for any help in advance.
>>Christoph Kiefer
>>Department of Informatics, University of Zurich
>>Office: Uni Irchel 27-K-32
>>Phone:  +41 (0) 44 / 635 67 26
>>To unsubscribe, e-mail:
>>For additional commands, e-mail:
> ------------------------------------------------------------------------
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message