pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maruan Sahyoun <sahy...@fileaffairs.de>
Subject Re: Having trouble removing elements from a PDF (part II)
Date Mon, 04 Feb 2013 16:02:33 GMT
Hi,

the typical approach is that you create a new pdf from the source copying over the elements
you are interested in. PDFMergerUtility gives you some hint how to do that. Now to the PDStream
question. The contents of a PDF are expressed using basic objects such as Strings, Numbers,
Arrays …. These can be included in a literal fashion or as part of a kind of byte array
- a PDStream (which can be compressed ….). A stream is typically used for larger "collections"
of PDF objects or where the object itself contains a large amount of data e.g. for images.
So in order to handle a stream you need to get to the contents of a stream and then male a
decision if you are interested in all objects within it or you can simply copy the entire
stream copying all contents.

It's a simplified description of the model but maybe you may want to read the start of the
PDF spec to make it clearer how a PDF is organized. Section 7.3. of the ISO32000 spec describes
the basic objects as well as streams.

Kind regards

Maruan Sahyoun

Am 04.02.2013 um 15:56 schrieb Dominic Jacobssen <dominic.jacobssen@gmail.com>:

> Hi all,
> 
> (Sorry for the non-threaded follow-up mail; I didn't receive my own
> mail from the list, so I couldn't answer to it, thereby preserving the
> thread).
> 
> After a few more hours of bashing against the problem, I've got the
> following code. The key insight (which had completely eluded me, I'm
> afraid) is that the contents of each page are inside the dictionary
> value corresponding to the "/Contents" key, and that this is a
> compressed stream. (I've also since worked out that
> PDPage.getContents() is a simpler way of getting hold of this).
> 
> I'm successfully finding all textual elements on the page, among which
> the footnote string is visible. So there is hope.
> 
> However, this code is currently "read only": since I'm calling
> PDStream.getStream(), and the page at:
> 
>    http://pdfbox.apache.org/userguide/index.html
> 
> says, "A stream of data, typically compressed. This is used for page
> content.", I presume that I can't modify the stream in-place, so I
> presume that I'd need to create a new stream and add items to it one
> by one.
> 
> Is this the right approach? Can I recreate the original page by
> creating a new page, copying across the metadata, then writing objects
> from the "source" page to the "destination" page's stream one by one?
> 
> Many thanks,
> 
> Dominic
> 
> -- Code snippet starts here --
> 
> PDFMergerUtility merger = new PDFMergerUtility();
> 
> for (PDFDocument inputDocument : inputDocuments ) {
>    byte[] buffer = inputDocument.getBytes();
> 
>    ByteArrayInputStream bais = new ByteArrayInputStream ( buffer);
>    PDDocument pdDocument = PDDocument.load ( bais );
> 
>    if (pdDocument.isEncrypted()) {
> 	try {
> 	    DecryptionMaterial dm = new StandardDecryptionMaterial("foo");
> 	    pdDocument.openProtection(dm);
> 	    System.out.println("Successfully decrypted file!");
> 	} catch (CryptographyException e) {
> 	    e.printStackTrace();
> 	} catch (BadSecurityHandlerException e) {
> 	    e.printStackTrace();
> 	}
>    }
> 
>    @SuppressWarnings("unchecked")
>    List<PDPage> allPages = pdDocument.getDocumentCatalog().getAllPages();
>    int nPages = pdDocument.getNumberOfPages();
>    for (PDPage onePage : allPages) {
> 	PDStream contents = onePage.getContents();
> 	COSStream cosStream = contents.getStream();
> 
> 	// This allows me to loop over the page's string content ...
> 	for (Object token : cosStream.getStreamTokens()) {
> 	    if (token instanceof COSString) {
> 		COSString cosString = (COSString) token;
> 		String s = cosString.getString();
> 		System.out.println("COSString: [" + s + "]");
> 	    }
> 	}
> 
> 	// ... but as this is a stream, it's not obvious how to
> 	// modify the stream. Do I need to create a new one?
> 	COSDocument doc = pdDocument.getDocument();
> 	ByteArrayOutputStream baos = new ByteArrayOutputStream();
> 	COSWriter cosWriter = new COSWriter(baos);
> 	try {
> 	    cosWriter.write(doc);
> 	} catch (COSVisitorException e) {
> 	    e.printStackTrace();
> 	}
> 	byte[] bytes = baos.toByteArray();
> 	ByteArrayInputStream bais = new ByteArrayInputStream(bytes);
> 	merger.addSource(bais);
>    }
> }
> 
> try {
>    merger.setDestinationFileName("cat.pdf");
>    merger.mergeDocuments();
> } catch (COSVisitorException e) {
>    e.printStackTrace();
> }


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message