pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: Can't resolve page number
Date Sat, 09 May 2015 21:50:15 GMT
Am 08.05.2015 um 17:17 schrieb noamsilver@gmail.com:
> Hello,
> I’m trying to parse a pdf file that I haven’t created, I’m using pdfBox v1.8.9.
> My problem is that when trying to getText(doc) form a certain section of the pdf using
setStartBookmark(item) and setEndBookmark(item) I get all the text rather than just the text
from the specified section.
> WhiIe trying to resolve this I realized that the writeText(doc, outputStream) method
always calls resetEngine() method. That will reset all the parameters and delete the bookmarks
I set.

That seems like a bug to me :-(

> So my first question is what is the correct way to get the text from a specified section
of the pdf?

To get it to work, I suggest you get the page number from the 
bookmarks... oops, that is what you tried:

> When I continued to try and resolve this I created a new class that extendsPDFTextStripper
and I changed the getText() and writeText() methods (also changing their names) so that it
won’t call the resetEngine() method while keeping the rest of the functionality (I also
had to delete the if (getAddMoreFormatting()) section as the parameters are private, is that
a problem?).
> Now when I call the method I created I have a second problem, while it tries to determine
the startBookmarkPageNumber in processPages method getPageNumber method returns -1.
> When I dug deeper I saw that in findDestinationPage method the rawDest is of type PDNamedDestination.
> The problem is that when trying to get namesDict = doc.getDocumentCatalog().getNames()
it returns null. That means that the names dictionary doesn’t exist. What can be done?

Could you upload the document to a public place? I'll research what is 
going on. Some code would be nice too i.e. what you tried to far to 
(not) get the page number.


> Just need to point out that in Acrobat the bookmarks all work.
> Noam

To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

View raw message