pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Murray-Rust <pm...@cam.ac.uk>
Subject Re: Extracting vector graphics from PDF
Date Thu, 26 Apr 2012 13:20:01 GMT
On Mon, Apr 2, 2012 at 2:58 PM, Peter Murray-Rust <pm286@cam.ac.uk> wrote:

> On Mon, Apr 2, 2012 at 2:51 PM, Andrey Kuznetsov <imagero@gmx.de> wrote:
>> Peter, you have to pass your own Graphics2D object (with some overridden
>> methods) to pdfbox.
I am making good progress in capturing graphics primitives by using the
Apache Batik. I have managed to intercept the Graphics2D by generating a
Batik SVGGraphics2D:

        org.w3c.dom.DOMImplementation domImpl =

        String svgNS = "http://www.w3.org/2000/svg";
        org.w3c.dom.Document document = domImpl.createDocument(svgNS,
"svg", null);
        SVGGraphics2D svgGraphics2D = new

I then pass this into PDFReader and use

            drawer.drawPage( svgGraphics2D, page, drawDimension );
            Writer svgwriter = new StringWriter();
            svgGraphics2D.stream(svgwriter, useCSS);

and then analyse the SVGDom in svgwiter.toString().

This works, but with problems.

The first implementation created outline fonts in Batik (i.e. closed
polylines for glyphs). I have then tried to clean up the code and it now
creates <text> objects with characters, but without a font and without a

Do you have suggestions as to how I can best capture the text info
reliably. I don't mind dealing with outline fonts as I have to do that for
user-created graphics anyway and have a good store of fonts. But I'd like
to know what switches need to be set. And I have a worry that I ma failing
to load fonts somewhere.

Any help much appreciated, but thanks anyway for progress so far.

Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message