pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dominic Jacobssen <dominic.jacobs...@gmail.com>
Subject Having trouble removing elements from a PDF.
Date Mon, 04 Feb 2013 12:43:08 GMT
Hi all; first post to the list here.

I'm attempting to use PDFBox to achieve the following.

We have a large database of pre-existing PDF documents that are
currently being programmatically created using iText, but in the
medium-to-long term we'd like to port the generation code to PDFBox.
We *do* control the PDF generation code, so if we need to do something
like "insert a special magic token" into the PDF, we can probably do
so, even though it's not currently PDFBox-based.

One of the tasks we need to do is to retrieve multiple pre-existing
documents from the database, in an order that is not known
ahead-of-time, and concatenate them into one long document. The kicker
is this: All the existing documents contain page numbers, and we want
to renumber the concatenated document (so that, for example, there's
only one page 1).

I've successfully managed to use the PDFMergerUtility class to perform
a concatenation; this was ever so slightly tricky, because the source
PDFs are encrypted, and the PDFMergerUtility.addSource() requires an
InputStream, so I ended up writing each decrypted page to a
ByteArrayOutputStream (via a COSWriter), then reading the bytes, then
wrapping the bytes into a ByteArrayInputStream for the benefit of
addSource(). I'm sure there are much more elegant ways, but this
first-cut approach worked.

Conceptually, in my (possibly naive) mind, the code structure I'm
aiming for looks like the following pseudocode:

outputDocument = new OutputDocument()
for document in documents:
    for page in document.getPages():
        outputPage = new Page()
        for element in page.getElements():
            if not isDesiredElement(element):

or, possibly, some sort of in-place deletion, like this:

outputDocument = new OutputDocument()
for document in documents:
    for page in document.getPages():
        elementMap = page.getElementMap()
        for k, v in elementMap.items():
            if not isDesiredElement(k,v):

However, I'm still stumped, after a day and a half, as to how to go
about finding the "page number element" on each page and suppressing
it. I've been pulling out COSDictionaries and PDStreams, and I'm
afraid I'm none the wiser. Specifically, the generator inserts
elements on the page in terms of "Phrases" and "Chunks"; I initially
believed that these were PDF concepts, and that there would be a
similar API on the receiving end to help identify such structure, but
it now appears that these are library-specific, and aren't "universal"
PDF concepts.

I'm starting to suspect that I might have to see what PDF commands the
source footnote stringifies to, and use some sort of evil regexp on
the PDF stream, but this feels both wrong and fragile.

Can any kind soul maybe nudge me in the right direction?



View raw message