pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Graeme Kidd <coolki...@hotmail.com>
Subject RE: Extract vectors
Date Wed, 11 Feb 2009 10:59:21 GMT

>BTW, can we assume that you really have Form XObjects in your PDFs? That question is still
According to the PDFTron code it never enters a Form XObject, it goes straight into a Path
For example:

while ((element = reader.next()) != null) // Read page contents
    switch (element.getType()) {
        case Element.e_path:      // Process path data...
                ProcessPath(reader, element);
                System.out.println("In ProcessPath");
        case Element.e_text_begin:      // Process text block...
        case Element.e_form:      // Process form XObjects
                System.out.println("Process form XObjects Begin");
                System.out.println("Process form XObjects End");
        case Element.e_image:      // Process Images
                System.out.println("In ProcessImage");

I see a lot of "In ProcessPath" and some "In ProcessImage" text being output but I dont see
any "Process form XObjects (Begin|End)" being displayed. This is how I am assuiming that my
Path data is in the Page stream and not in a form XObject

> Date: Wed, 11 Feb 2009 10:29:05 +0100
> From: dev@jeremias-maerki.ch
> To: pdfbox-users@incubator.apache.org
> Subject: Re: Extract vectors
> On 11.02.2009 02:33:19 Graeme Kidd wrote:
>> Hi again,
>> I am still having problems reading in path data from PDFBox. I have an
>> example working with PDFTron that displays the path data no problem but
>> after inspecting the code it seems it never enters a XObject or Form
>> XObject. (I can give you an example of the code if you want.)
>> It simply seems to walks straight into a Path object using an
>> "ElementReader" which provides a way of traversing the Element display list
>> of a page. According to its documentation:
>> "The display list representing graphical elements (such as text-runs, paths,
>> images, shadings, forms, etc) is accessed using the intrinsic iterator.
>> ElementReader automatically concatenates page contents spanning multiple
>> streams and provides a mechanism to parse contents of sub-display lists
>> (e.g. forms XObjects and Type3 fonts). "
>> Is it possible for Path Objects to not be inside a form XObject?
> Yes, they can be in a normal page stream.
>> In my brief
>> reading of the PDF Spec it doesn't seem to explicitly say that path data
>> will be found in form XObjects. Just that "a form XObject is an entire
>> content stream to be treated as a single graphics object".
> Right, the "content stream" is what can contain path data for painting
> vector graphics. Both pages and Form XObjects content have a content
> stream.
>> If my understanding is correct a XObject is a an external object that can be
>> referenced in the content stream so that content can be reused. Then if the
>> image only appears once there will be no reason create a reference for it.
> No reason maybe but it's still possible to do it this way.
>> If that is the case how did Adobe know where the vector images were? Do you
>> think they went as far as hit testing the paths to see if the paths were
>> somehow grouped together? As currently all I have is all the vector graphics
>> found on a page in one EPS file, rather than an EPS file for each vector
>> graphic in a page.
> Same way PDFBox does: Interpret the page stream. When a form is
> referenced, interpret that form. And if you just want to extract the
> forms, just start with the Form XObject instead of with the page. That's
> what Andreas has been experimenting with (as I'm sure you've read in
> this thread). I believe there's only a relatively small step involved to
> offering extraction of form XObjects as vector graphics using PDFBox.
> I'm not sure what exactly you're tinkering with, though. If I had more
> free time I'd probably try to do this myself.
> BTW, can we assume that you really have Form XObjects in your PDFs? That
> question is still open.
> Jeremias Maerki
Check out the new and improved services from Windows Live. Learn more! 
View raw message