pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Noam Silver <noamsil...@gmail.com>
Subject Re: Can't resolve page number
Date Sun, 10 May 2015 17:58:48 GMT
Thanks Tilman for all your great and fast work.
Unfortunately I can't share the pdf publicly, it's copyrighted.
My code for extracting the text is (simplified):

    public static void main(String[] args) throws IOException {
        PDDocument doc = null;
        boolean hasOutputPath = false;

        if (args.length != 1 && args.length != 2) {
        if (args.length == 2) {
            hasOutputPath = true;
        try {
            doc = PDDocument.load(args[0]);
            if (doc.isEncrypted())
                StandardDecryptionMaterial sdm = new
        catch (IOException e) {
            System.err.println("Error loading PDF file");
        catch (BadSecurityHandlerException e) {
        catch (CryptographyException e) {

        TextParser parser = new TextParser(hasOutputPath? args[1]:
args[0]);//A class of mine to parse the text received

        PDDocumentOutline outlineRoot =
        PDOutlineItem parentItem = outlineRoot.getFirstChild();

        String parentTitleName;
        String currentChildTitleName;
        String nextChildTitleName;

        PDFTextStripperExt stripper = new PDFTextStripperExt();
        boolean childrenWereParsed = false;

        while (parentItem != null) {
            parentTitleName = parentItem.getTitle();
            if (Pattern.matches(".*Commands", parentTitleName)) {
                PDOutlineItem item = parentItem.getFirstChild();
                while (item != null) {
                    currentChildTitleName = item.getTitle();
                    if ((item = item.getNextSibling()) == null) {
                        nextChildTitleName = (parentItem =
parentItem.getNextSibling()).getTitle();/*need to check null on next parent
item but in this pdf case it won't happen*/
                    else {
                        nextChildTitleName = item.getTitle();
currentChildTitleName, nextChildTitleName);
                childrenWereParsed = true;
            if (!childrenWereParsed) {
                parentItem = parentItem.getNextSibling();
(there might be some syntax errors since I simplified the code, but this is
the main concept)

The code which I was talking about with the *namesDict =
*returns *null *is part of the pdfbox code in the *findDestinationPage *method
in the section of the *if( rawDest instanceof PDNamedDestination )* in the
*PDOutlineItem* class.
It sems that there is an anomaly in this spacific pdf. Ill try to load the
pdf with *loadNonSeq(file,null) *and see what's the difference.


On Sun, May 10, 2015 at 5:37 PM, Tilman Hausherr <THausherr@t-online.de>

> Am 08.05.2015 um 17:17 schrieb noamsilver@gmail.com:
>> I’m trying to parse a pdf file that I haven’t created, I’m using pdfBox
>> v1.8.9.
>> My problem is that when trying to getText(doc) form a certain section of
>> the pdf using setStartBookmark(item) and setEndBookmark(item) I get all the
>> text rather than just the text from the specified section.
>> WhiIe trying to resolve this I realized that the writeText(doc,
>> outputStream) method always calls resetEngine() method. That will reset all
>> the parameters and delete the bookmarks I set.
>> So my first question is what is the correct way to get the text from a
>> specified section of the pdf?
> I've now hopefully fixed that problem in
> https://issues.apache.org/jira/browse/PDFBOX-2792
> a snapshot version will soon be available here:
> https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/1.8.10-SNAPSHOT/
>  When I continued to try and resolve this I created a new class that
>> extendsPDFTextStripper and I changed the getText() and writeText() methods
>> (also changing their names) so that it won’t call the resetEngine() method
>> while keeping the rest of the functionality (I also had to delete the if
>> (getAddMoreFormatting()) section as the parameters are private, is that a
>> problem?).
>> Now when I call the method I created I have a second problem, while it
>> tries to determine the startBookmarkPageNumber in processPages method
>> getPageNumber method returns -1.
>> When I dug deeper I saw that in findDestinationPage method the rawDest is
>> of type PDNamedDestination.
>> The problem is that when trying to get namesDict =
>> doc.getDocumentCatalog().getNames() it returns null. That means that the
>> names dictionary doesn’t exist. What can be done?
>> Just need to point out that in Acrobat the bookmarks all work.
> I tested this on a document with names, and I didn't have that effect with
> 1.8.9, so whatever the problem is, it isn't a general problem, so I need
> the file.
> One thing to try is to load the document with loadNonSeq(file,null)
> instead of load().
> Tilman
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message