pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeremias Maerki <...@jeremias-maerki.ch>
Subject Re: Extract vectors
Date Wed, 11 Feb 2009 09:29:05 GMT
On 11.02.2009 02:33:19 Graeme Kidd wrote:
> Hi again,
> I am still having problems reading in path data from PDFBox. I have an 
> example working with PDFTron that displays the path data no problem but 
> after inspecting the code it seems it never enters a XObject or Form 
> XObject. (I can give you an example of the code if you want.)
> It simply seems to walks straight into a Path object using an 
> "ElementReader" which provides a way of traversing the Element display list 
> of a page. According to its documentation:
> "The display list representing graphical elements (such as text-runs, paths, 
> images, shadings, forms, etc) is accessed using the intrinsic iterator. 
> ElementReader automatically concatenates page contents spanning multiple 
> streams and provides a mechanism to parse contents of sub-display lists 
> (e.g. forms XObjects and Type3 fonts). "
> Is it possible for Path Objects to not be inside a form XObject?

Yes, they can be in a normal page stream.

> In my brief 
> reading of the PDF Spec it doesn't seem to explicitly say that path data 
> will be found in form XObjects. Just that "a form XObject is an entire 
> content stream to be treated as a single graphics object".

Right, the "content stream" is what can contain path data for painting
vector graphics. Both pages and Form XObjects content have a content

> If my understanding is correct a XObject is a an external object that can be 
> referenced in the content stream so that content can be reused. Then if the 
> image only appears once there will be no reason create a reference for it.

No reason maybe but it's still possible to do it this way.

> If that is the case how did Adobe know where the vector images were? Do you 
> think they went as far as hit testing the paths to see if the paths were 
> somehow grouped together? As currently all I have is all the vector graphics 
> found on a page in one EPS file, rather than an EPS file for each vector 
> graphic in a page.

Same way PDFBox does: Interpret the page stream. When a form is
referenced, interpret that form. And if you just want to extract the
forms, just start with the Form XObject instead of with the page. That's
what Andreas has been experimenting with (as I'm sure you've read in
this thread). I believe there's only a relatively small step involved to
offering extraction of form XObjects as vector graphics using PDFBox.
I'm not sure what exactly you're tinkering with, though. If I had more
free time I'd probably try to do this myself.

BTW, can we assume that you really have Form XObjects in your PDFs? That
question is still open.

Jeremias Maerki

View raw message