pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <noamsil...@gmail.com>
Subject Can't resolve page number
Date Fri, 08 May 2015 15:17:55 GMT




Hello,

I’m trying to parse a pdf file that I haven’t created, I’m using pdfBox v1.8.9.

My problem is that when trying to getText(doc) form a certain section of the pdf using setStartBookmark(item)
and setEndBookmark(item) I get all the text rather than just the text from the specified section.

WhiIe trying to resolve this I realized that the writeText(doc, outputStream) method always
calls resetEngine() method. That will reset all the parameters and delete the bookmarks I
set.

So my first question is what is the correct way to get the text from a specified section of
the pdf?

When I continued to try and resolve this I created a new class that extendsPDFTextStripper
and I changed the getText() and writeText() methods (also changing their names) so that it
won’t call the resetEngine() method while keeping the rest of the functionality (I also
had to delete the if (getAddMoreFormatting()) section as the parameters are private, is that
a problem?).

Now when I call the method I created I have a second problem, while it tries to determine
the startBookmarkPageNumber in processPages method getPageNumber method returns -1. 

When I dug deeper I saw that in findDestinationPage method the rawDest is of type PDNamedDestination.

The problem is that when trying to get namesDict = doc.getDocumentCatalog().getNames() it
returns null. That means that the names dictionary doesn’t exist. What can be done?

Just need to point out that in Acrobat the bookmarks all work.


Noam
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message