pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dominic Jacobssen <dominic.jacobs...@gmail.com>
Subject Having trouble removing elements from a PDF (part II)
Date Mon, 04 Feb 2013 14:56:14 GMT
Hi all,

(Sorry for the non-threaded follow-up mail; I didn't receive my own
mail from the list, so I couldn't answer to it, thereby preserving the
thread).

After a few more hours of bashing against the problem, I've got the
following code. The key insight (which had completely eluded me, I'm
afraid) is that the contents of each page are inside the dictionary
value corresponding to the "/Contents" key, and that this is a
compressed stream. (I've also since worked out that
PDPage.getContents() is a simpler way of getting hold of this).

I'm successfully finding all textual elements on the page, among which
the footnote string is visible. So there is hope.

However, this code is currently "read only": since I'm calling
PDStream.getStream(), and the page at:

    http://pdfbox.apache.org/userguide/index.html

says, "A stream of data, typically compressed. This is used for page
content.", I presume that I can't modify the stream in-place, so I
presume that I'd need to create a new stream and add items to it one
by one.

Is this the right approach? Can I recreate the original page by
creating a new page, copying across the metadata, then writing objects
from the "source" page to the "destination" page's stream one by one?

Many thanks,

Dominic

-- Code snippet starts here --

PDFMergerUtility merger = new PDFMergerUtility();

for (PDFDocument inputDocument : inputDocuments ) {
    byte[] buffer = inputDocument.getBytes();

    ByteArrayInputStream bais = new ByteArrayInputStream ( buffer);
    PDDocument pdDocument = PDDocument.load ( bais );

    if (pdDocument.isEncrypted()) {
	try {
	    DecryptionMaterial dm = new StandardDecryptionMaterial("foo");
	    pdDocument.openProtection(dm);
	    System.out.println("Successfully decrypted file!");
	} catch (CryptographyException e) {
	    e.printStackTrace();
	} catch (BadSecurityHandlerException e) {
	    e.printStackTrace();
	}
    }

    @SuppressWarnings("unchecked")
    List<PDPage> allPages = pdDocument.getDocumentCatalog().getAllPages();
    int nPages = pdDocument.getNumberOfPages();
    for (PDPage onePage : allPages) {
	PDStream contents = onePage.getContents();
	COSStream cosStream = contents.getStream();

	// This allows me to loop over the page's string content ...
	for (Object token : cosStream.getStreamTokens()) {
	    if (token instanceof COSString) {
		COSString cosString = (COSString) token;
		String s = cosString.getString();
		System.out.println("COSString: [" + s + "]");
	    }
	}

	// ... but as this is a stream, it's not obvious how to
	// modify the stream. Do I need to create a new one?
	COSDocument doc = pdDocument.getDocument();
	ByteArrayOutputStream baos = new ByteArrayOutputStream();
	COSWriter cosWriter = new COSWriter(baos);
	try {
	    cosWriter.write(doc);
	} catch (COSVisitorException e) {
	    e.printStackTrace();
	}
	byte[] bytes = baos.toByteArray();
	ByteArrayInputStream bais = new ByteArrayInputStream(bytes);
	merger.addSource(bais);
    }
}

try {
    merger.setDestinationFileName("cat.pdf");
    merger.mergeDocuments();
} catch (COSVisitorException e) {
    e.printStackTrace();
}

Mime
View raw message