lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason <>
Subject hithighlighter bug
Date Wed, 10 Jan 2007 02:34:54 GMT
Hi all,
	I have come across what I think is a curious but insidious bug with the 
java lucene hit highlighter. I updated to the latest version of lucene 
and the highlighter because I first found this problem in the lucene 
v1.4 version, unfortunately its still there in v2.0.0 versions.

I am indexing XML documents and am also using the hit highlighter for 
search results. This works perfectly in almost every case except for one.

in my I have this:

public class LuceneSearch implements
	public String highlightTerm(String originalText , TokenGroup group)
			return originalText;
		return "<em>" + originalText + "</em>";

when I search for -> Acquisition Plan <-
in my search results I get:
<summary>(ancilliary stuff deleted)....
attached to the <em>Acquisition</em>
< em>Plan</em>and signed</summary>

notice the space between the < and e in the second < em>
This only occurs for these search terms and for this document (as far as 
I know) but because its part of a much larger XML document it breaks the 
whole thing.

the original XML is unremarkable with no strange characters surrounding 
these terms - a snipit from the relevant paragraph from which these 
highlighted terms come:

-> attached to the Acquisition Plan and signed off<-

has anyone seen anything like this before? is this a genuine new bug or 
something of which the lucene folk (or at least whoever wrote the 
highlighter) are aware? can anyone think of a way to fix this without 
scanning every element in my result text for rogue spaces?

Thanks in advance

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message