lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Spencer <dave-lucene-u...@tropo.com>
Subject Re: full text as input ?
Date Fri, 14 Jan 2005 06:16:20 GMT
Hunter Peress wrote:

> is it efficient and feasible to use lucene to do full text
> comparisions. eg :  take an entire text thats reasonably large ( eg
> more than 10 words) and find the result set within the lucene search
> index that  is statistically similar with all the text.

I do this kind of stuff all the time, no problem.
I think this came up a month ago - probably appears monthly.
For another variation search for "MoreLikeThis" in the list - it's code 
I mailed in that I haven't, yet, checked in.

Anyway, if you want to search for docs that are similar to a source 
document, you can all this method to generate a similarity query.

'srch' is the source doc
'a' is your analyzer
'field' is the field that stores the body e.g. "contents"
'stop' is an opt Set of stop words to ignore as an optimization - it's 
not needed if the Analyzer ignores stop words, but if you keep stop 
words you might still want to ignore them in this kind of query as they 
probably won't help


   public static Query formSimilarQuery( String srch,								 
Analyzer a,									String field,									Set stop)
		throws org.apache.lucene.queryParser.ParseException, IOException
	{	
		TokenStream ts = a.tokenStream( field, new StringReader( srch));
		org.apache.lucene.analysis.Token t;
		BooleanQuery tmp = new BooleanQuery();
		Set already = new HashSet();
		while ( (t = ts.next()) != null)
		{
			String word = t.termText();
			if ( stop != null &&
				 stop.contains( word)) continue;
			if ( ! already.add( word)) continue;
			TermQuery tq = new TermQuery( new Term( field, word));
			tmp.add( tq, false, false);
		}

		// tbd, from lucene in action book
		// https://secure.manning.com/catalog/view.php?book=hatcher2&item=source
		// exclude myself
		//likeThisQuery.add(new TermQuery(
         //new Term("isbn", doc.get("isbn"))), false, true);
		return tmp;

	}

> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message