lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Renaud Waldura" <>
Subject PDF Highlighting Again
Date Thu, 09 Nov 2006 18:55:41 GMT
I read the mailing-list archives about this topic and found the PDFBox
solutions at:

Basically there are 3 options:
1- append query parameters to the PDF URL
2- generate a highlight XML document that Acrobat Reader will download
3- alter contents of the PDF

I've only tried (1) so far and I really like the user experience: Acrobat
Reader displays a list of matches in a small pane on the right. By clicking
in this pane the user can bounce around the document. I don't think (2) or
(3) can match that.

I implemented (1) by rewriting the Lucene query and creating a query string
with the terms. I am now facing a problem where the number of terms expand
beyond the limits of the query string.

I realized it's silly to ask Acrobat to highlight all the query terms when
in reality only a subset of them is present in the document. But how can I
compute that subset?

I'm thinking I might have to tokenize the document text (I have it), then
compute the intersection between the set of all terms and the set of terms
from the rewritten query. Blech. Sounds expensive. Any other ideas?

How does the contrib'ed highlighter work? I'm guessing it's doing something
similar. Any way I could not tokenize the document twice (they can be really


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message