lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason <jeac...@hardlight.com.au>
Subject hithighlighter bug
Date Wed, 10 Jan 2007 02:34:54 GMT
Hi all,
	I have come across what I think is a curious but insidious bug with the 
java lucene hit highlighter. I updated to the latest version of lucene 
and the highlighter because I first found this problem in the lucene 
v1.4 version, unfortunately its still there in v2.0.0 versions.

I am indexing XML documents and am also using the hit highlighter for 
search results. This works perfectly in almost every case except for one.

in my I have this:

public class LuceneSearch implements 
org.apache.lucene.search.highlight.Formatter
{
...
	public String highlightTerm(String originalText , TokenGroup group)
	{
		if(group.getTotalScore()<=0)
		{
			return originalText;
		}
		return "<em>" + originalText + "</em>";
	}

when I search for -> Acquisition Plan <-
in my search results I get:
<summary>(ancilliary stuff deleted)....
attached to the <em>Acquisition</em>
< em>Plan</em>and signed</summary>

notice the space between the < and e in the second < em>
This only occurs for these search terms and for this document (as far as 
I know) but because its part of a much larger XML document it breaks the 
whole thing.

the original XML is unremarkable with no strange characters surrounding 
these terms - a snipit from the relevant paragraph from which these 
highlighted terms come:

-> attached to the Acquisition Plan and signed off<-

has anyone seen anything like this before? is this a genuine new bug or 
something of which the lucene folk (or at least whoever wrote the 
highlighter) are aware? can anyone think of a way to fix this without 
scanning every element in my result text for rogue spaces?

Thanks in advance
Jason.






---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message