lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: pdf and highlighting
Date Thu, 08 Dec 2005 09:58:35 GMT
Sonja,

Do you have an example, or at least some relevant code, that would  
help the community in helping resolve this?

	Erik

On Dec 8, 2005, at 4:24 AM, Sonja Löhr wrote:

>
> Hi, all!
>
> I have a question concerning analysis and highlighting. I'm indexing
> multiple document formats (up to now, only html and pdf occured,  
> and use the
> highlighter from the Lucene sandbox.
> The documents text is extracted via JTidy and PDFBox, respectively,  
> then in
> both indexing and search analysed with a custom subclass of  
> GermanAnalyzer,
> which replaces character references and html entities.
>
> Now the funny thing is that, even if I store the body text,  
> highlighter uses
> wrong positions with lucene Docs stemming from pdf documents,  
> whereas html
> is hightlighted correctly.  I really don't have an explanation for  
> this
> behaviour - for doc.get("body") is simply text, in either case, and  
> stop
> words were also removed in ALL kinds of documents (and through an  
> instance
> of the same analyzer passed to QueryParser.
>
> Are there any hints to where I could find my error - or did anybody  
> else
> encounter the same problem?
>
> Thanks in advance!
>
> sonja
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message