pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Noam Silver <noamsil...@gmail.com>
Subject Re: Can't resolve page number
Date Sun, 10 May 2015 17:58:48 GMT
Thanks Tilman for all your great and fast work.
Unfortunately I can't share the pdf publicly, it's copyrighted.
My code for extracting the text is (simplified):

    public static void main(String[] args) throws IOException {
        PDDocument doc = null;
        boolean hasOutputPath = false;

        if (args.length != 1 && args.length != 2) {
            usage();
            System.exit(0);
        }
        if (args.length == 2) {
            hasOutputPath = true;
        }
        try {
            doc = PDDocument.load(args[0]);
            if (doc.isEncrypted())
            {
                StandardDecryptionMaterial sdm = new
StandardDecryptionMaterial("");
                doc.openProtection(sdm);
            }
        }
        catch (IOException e) {
            System.err.println("Error loading PDF file");
            e.printStackTrace();
            System.exit(0);
        }
        catch (BadSecurityHandlerException e) {
            e.printStackTrace();
            System.exit(0);
        }
        catch (CryptographyException e) {
            e.printStackTrace();
            System.exit(0);
        }

        TextParser parser = new TextParser(hasOutputPath? args[1]:
args[0]);//A class of mine to parse the text received

        PDDocumentOutline outlineRoot =
doc.getDocumentCatalog().getDocumentOutline();
        PDOutlineItem parentItem = outlineRoot.getFirstChild();

        String parentTitleName;
        String currentChildTitleName;
        String nextChildTitleName;

        PDFTextStripperExt stripper = new PDFTextStripperExt();
        boolean childrenWereParsed = false;

        while (parentItem != null) {
            parentTitleName = parentItem.getTitle();
            if (Pattern.matches(".*Commands", parentTitleName)) {
                PDOutlineItem item = parentItem.getFirstChild();
                while (item != null) {
                    currentChildTitleName = item.getTitle();
                    stripper.setStartBookmark(item);
                    if ((item = item.getNextSibling()) == null) {
                        nextChildTitleName = (parentItem =
parentItem.getNextSibling()).getTitle();/*need to check null on next parent
item but in this pdf case it won't happen*/
                        stripper.setEndBookmark(parentItem);
                    }
                    else {
                        nextChildTitleName = item.getTitle();
                        stripper.setEndBookmark(item);
                    }
                    parser.parseText(stripper.getTextBySpecification(doc),
currentChildTitleName, nextChildTitleName);
                    docCount++;
                }
                childrenWereParsed = true;
            }
            if (!childrenWereParsed) {
                parentItem = parentItem.getNextSibling();
            }
        }
    }
(there might be some syntax errors since I simplified the code, but this is
the main concept)

The code which I was talking about with the *namesDict =
doc**.getDocumentCatalog().getNames()
*returns *null *is part of the pdfbox code in the *findDestinationPage *method
in the section of the *if( rawDest instanceof PDNamedDestination )* in the
*PDOutlineItem* class.
It sems that there is an anomaly in this spacific pdf. Ill try to load the
pdf with *loadNonSeq(file,null) *and see what's the difference.

Noam



On Sun, May 10, 2015 at 5:37 PM, Tilman Hausherr <THausherr@t-online.de>
wrote:

> Am 08.05.2015 um 17:17 schrieb noamsilver@gmail.com:
>
>> I’m trying to parse a pdf file that I haven’t created, I’m using pdfBox
>> v1.8.9.
>>
>> My problem is that when trying to getText(doc) form a certain section of
>> the pdf using setStartBookmark(item) and setEndBookmark(item) I get all the
>> text rather than just the text from the specified section.
>>
>> WhiIe trying to resolve this I realized that the writeText(doc,
>> outputStream) method always calls resetEngine() method. That will reset all
>> the parameters and delete the bookmarks I set.
>>
>> So my first question is what is the correct way to get the text from a
>> specified section of the pdf?
>>
>
> I've now hopefully fixed that problem in
> https://issues.apache.org/jira/browse/PDFBOX-2792
> a snapshot version will soon be available here:
>
> https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/1.8.10-SNAPSHOT/
>
>  When I continued to try and resolve this I created a new class that
>> extendsPDFTextStripper and I changed the getText() and writeText() methods
>> (also changing their names) so that it won’t call the resetEngine() method
>> while keeping the rest of the functionality (I also had to delete the if
>> (getAddMoreFormatting()) section as the parameters are private, is that a
>> problem?).
>>
>> Now when I call the method I created I have a second problem, while it
>> tries to determine the startBookmarkPageNumber in processPages method
>> getPageNumber method returns -1.
>>
>> When I dug deeper I saw that in findDestinationPage method the rawDest is
>> of type PDNamedDestination.
>>
>> The problem is that when trying to get namesDict =
>> doc.getDocumentCatalog().getNames() it returns null. That means that the
>> names dictionary doesn’t exist. What can be done?
>>
>> Just need to point out that in Acrobat the bookmarks all work.
>>
>
> I tested this on a document with names, and I didn't have that effect with
> 1.8.9, so whatever the problem is, it isn't a general problem, so I need
> the file.
>
> One thing to try is to load the document with loadNonSeq(file,null)
> instead of load().
>
> Tilman
>
>
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message