lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sonja Löhr <>
Subject pdf and highlighting
Date Thu, 08 Dec 2005 09:24:38 GMT

Hi, all!

I have a question concerning analysis and highlighting. I'm indexing
multiple document formats (up to now, only html and pdf occured, and use the
highlighter from the Lucene sandbox.
The documents text is extracted via JTidy and PDFBox, respectively, then in
both indexing and search analysed with a custom subclass of GermanAnalyzer,
which replaces character references and html entities.

Now the funny thing is that, even if I store the body text, highlighter uses
wrong positions with lucene Docs stemming from pdf documents, whereas html
is hightlighted correctly.  I really don't have an explanation for this
behaviour - for doc.get("body") is simply text, in either case, and stop
words were also removed in ALL kinds of documents (and through an instance
of the same analyzer passed to QueryParser.

Are there any hints to where I could find my error - or did anybody else
encounter the same problem?

Thanks in advance!


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message