lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: pdf and highlighting
Date Thu, 08 Dec 2005 20:07:38 GMT

On Dec 8, 2005, at 10:51 AM, Sonja Löhr wrote:
> Thank you both, I found it
> (I really asked a bit too early, sorry)
>
> The highlighter works correct if I use my custom Analyzer during  
> indexing
> (and for QueryParser), BUT
> when preparing the TokenStream to feed the highlighter, I must NOT  
> use it.
>
> TokenStream tStream = new GermanAnalyzer().tokenStream("body", new
> StringReader(bodyText));		
> System.out.println( highlighter.getBestFragments(tStream, bodyText,  
> 4, "
> ..... "));
>
> works, wheras
>
> TokenStream tStream = new GermanHtmlAnalyzer().tokenStream("body", new
> StringReader(bodyText));		
> System.out.println( highlighter.getBestFragments(tStream, bodyText,  
> 4, "
> ..... "));
>
> gives rubbish highlighting.
>
> GermanHtmlAnalyzer feeds a normal GermanAnalyzer with a shortened  
> String
> (native characters) if the input contains decimal or html entities,  
> but then
> I'm totally confused why there is a problem with pdf text and not  
> with HTML
> text...

The likely reason is that the token offsets fed to the highlighter  
don't jive with the positions of the text in the text you're  
highlighting.  You're generating token offsets for strings that have  
been replaced (and likely different sizes), but highlighting the  
original text with the entities left intact.

Maybe??

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message