pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Murray-Rust <pm...@cam.ac.uk>
Subject Re: Extracting vector graphics from PDF
Date Tue, 08 May 2012 07:43:27 GMT
Thanks Andrey,

Here are the SVG files from the two versions (I apologize for the verbosity
but people may want to inspect the paths:

1.7.0 paths only (glyphs)
<svg fill-opacity="1" xmlns:xlink="http://www.w3.org/1999/xlink"
color-rendering="auto" color-interpolation="auto" stroke="black"
text-rendering="auto" stroke-linecap="square" stroke-miterlimit="10"
stroke-opacity="1" shape-rendering="auto" fill="black"
stroke-dasharray="none" font-weight="normal" stroke-width="1" xmlns="
http://www.w3.org/2000/svg" font-family="&apos;Dialog&apos;"
font-style="normal" stroke-linejoin="miter" font-size="12"
stroke-dashoffset="0" image-rendering="auto">
  <!--Generated by the Batik Graphics2D SVG Generator-->
  <defs id="genericDefs" />
  <g>
    <defs id="defs1">
      <clipPath clipPathUnits="userSpaceOnUse" id="clipPath1">
        <path d="M0 0 L60.9419 0 L60.9419 81.2217 L0 81.2217 L0 0 Z" />
      </clipPath>
      <clipPath clipPathUnits="userSpaceOnUse" id="clipPath2">
        <path d="M0 0 L57.9843 0 L57.9843 77.2799 L0 77.2799 L0 0 Z" />
      </clipPath>
      <clipPath clipPathUnits="userSpaceOnUse" id="clipPath3">
        <path d="M0 0 L64.9174 0 L64.9174 86.5201 L0 86.5201 L0 0 Z" />
      </clipPath>
      <clipPath clipPathUnits="userSpaceOnUse" id="clipPath4">
        <path d="M0 0 L81.2564 0 L81.2564 121.8482 L0 121.8482 L0 0 Z" />
      </clipPath>
      <clipPath clipPathUnits="userSpaceOnUse" id="clipPath5">
        <path d="M0 0 L74.6531 0 L74.6531 99.4956 L0 99.4956 L0 0 Z" />
      </clipPath>
    </defs>
    <g text-rendering="optimizeLegibility"
transform="matrix(9.7634,0,0,9.7634,0,0)">
      <path d="M6.2598 9.3956 L6.2598 9.6299 C6.2598 9.7549 6.2598 9.8956
6.2598 9.9424 L6.3379 9.9737 L6.3379 9.9893 C6.2911 9.9893 6.2598 9.9893
6.2129 9.9893 C6.1661 9.9893 6.1348 9.9893 6.0879 9.9893 L6.0879 9.9737
L6.1661 9.9424 C6.1661 9.8643 6.1661 9.8018 6.1661 9.7393 L6.1661 9.7393
C6.1348 9.7706 6.0879 9.7862 6.0567 9.7862 C5.9473 9.7862 5.8379 9.6924
5.8379 9.5518 C5.8379 9.4268 5.9629 9.3174 6.0879 9.3174 C6.1192 9.3174
6.1504 9.3331 6.1817 9.3643 C6.1973 9.3331 6.2598 9.3174 6.2911 9.3174
L6.2911 9.3174 C6.2754 9.3487 6.2598 9.3799 6.2598 9.3956 ZM6.1661 9.7081
L6.1661 9.3956 C6.1661 9.3799 6.1348 9.3643 6.1036 9.3643 C6.0098 9.3643
5.9317 9.4424 5.9317 9.5518 C5.9317 9.6612 6.0098 9.7237 6.0879 9.7237
C6.1192 9.7237 6.1504 9.7237 6.1661 9.7081 Z" clip-path="url(#clipPath1)"
stroke="none" />
      <path d="M6.455 9.3799 L6.4081 9.3643 L6.4081 9.3487 C6.4394 9.3331
6.5175 9.3331 6.5488 9.3331 C6.5488 9.3643 6.5488 9.4424 6.5488 9.6143
C6.5488 9.6612 6.5488 9.6768 6.5644 9.6924 C6.58 9.7081 6.5956 9.7237
6.6269 9.7237 C6.6738 9.7237 6.7206 9.6924 6.7363 9.6612 L6.7363 9.3956
C6.7363 9.3799 6.7206 9.3799 6.6738 9.3643 L6.6894 9.3331 C6.7206 9.3331
6.7988 9.3174 6.83 9.3331 C6.83 9.3799 6.83 9.4424 6.83 9.6924 C6.8456
9.7237 6.8769 9.7393 6.9081 9.7549 L6.8925 9.7706 C6.8925 9.7706 6.8613
9.7862 6.8456 9.7862 C6.7988 9.7862 6.7519 9.7706 6.7519 9.7081 L6.7363
9.7081 C6.705 9.7549 6.6581 9.7862 6.5956 9.7862 C6.5644 9.7862 6.5175
9.7706 6.5019 9.7393 C6.4706 9.7237 6.455 9.6768 6.455 9.6299 C6.455 9.5518
6.455 9.4581 6.455 9.3799 Z" clip-path="url(#clipPath1)" stroke="none" />

and 1.6.0
<svg fill-opacity="1" xmlns:xlink="http://www.w3.org/1999/xlink"
color-rendering="auto" color-interpolation="auto" stroke="black"
text-rendering="auto" stroke-linecap="square" stroke-miterlimit="10"
stroke-opacity="1" shape-rendering="auto" fill="black"
stroke-dasharray="none" font-weight="normal" stroke-width="1" xmlns="
http://www.w3.org/2000/svg" font-family="&apos;Dialog&apos;"
font-style="normal" stroke-linejoin="miter" font-size="12"
stroke-dashoffset="0" image-rendering="auto">
  <!--Generated by the Batik Graphics2D SVG Generator-->
  <defs id="genericDefs" />
  <g>
    <defs id="defs1">
      <clipPath clipPathUnits="userSpaceOnUse" id="clipPath1">
        <path d="M0 0 L60.9419 0 L60.9419 81.2217 L0 81.2217 L0 0 Z" />
      </clipPath>
      <clipPath clipPathUnits="userSpaceOnUse" id="clipPath2">
        <path d="M0 0 L57.9843 0 L57.9843 77.2799 L0 77.2799 L0 0 Z" />
      </clipPath>
      <clipPath clipPathUnits="userSpaceOnUse" id="clipPath3">
        <path d="M0 0 L64.9174 0 L64.9174 86.5201 L0 86.5201 L0 0 Z" />
      </clipPath>
      <clipPath clipPathUnits="userSpaceOnUse" id="clipPath4">
        <path d="M0 0 L81.2564 0 L81.2564 121.8482 L0 121.8482 L0 0 Z" />
      </clipPath>
      <clipPath clipPathUnits="userSpaceOnUse" id="clipPath5">
        <path d="M0 0 L74.6531 0 L74.6531 99.4956 L0 99.4956 L0 0 Z" />
      </clipPath>
    </defs>
    <g text-rendering="optimizeLegibility" font-size="1"
font-family="&apos;null&apos;" transform="matrix(9.7634,0,0,9.7634,0,0)">
      <text xml:space="preserve" x="5.8067" y="9.7706"
clip-path="url(#clipPath1)" stroke="none">q</text>
      <text xml:space="preserve" x="6.3769" y="9.7706"
clip-path="url(#clipPath1)" stroke="none">u</text>
(this is part of the string "question")

I have verified that the glyphs correspond to "q" and "u". There is a
useful heuristic in that the clipPaths appear to be coupled to the fonts
(including , I think, different font-sizes) so it effectively records the
fonts, their glyphs and their metrics. I am assuming that if I knew how I
could dump the font information (presumably through the COSDictionary).
That would give me most of what I need:
* the character (from 1.6.0)
* the character position (from 1.6.0)
* the glyph (from 1.7.0) giving (i) the coordinate origin (ii) the width
and height and (iii) an indication of italic (neither 1.7.0 and 1.6.0
decode the glyph as itallic so I will have to use heuristics

This is very tedious, but at least it's possible. However I would suggest
to the PDFBox developers that they preserve the character info when
transmitting to the drawing surface Graphics2D. This would allow different
fonts, even if not as beautiful.

On Tue, May 8, 2012 at 5:44 AM, Andrey Kuznetsov <imagero@gmx.de> wrote:

> Hi Peter,****
>
> ** **
>
> >When I use 1.7.0 NO text is written. Instead the characters are replaced
> by outline glyphs using <svg:path>. The visual layout is effectively the
> same as the input PDF >but there are no explicit characters.****
>
> ** **
>
> Wow! They managed to implement it like Adobe suggested!****
>
> **
>
Good. We understand each other I think!


> **
>
> >I guess that in 1.7.0 NO characters are transmitted to drawString and
> that everything is drawShape(), with the precomputed glyphs.****
>
> Yes, you have to trace from where g.draw(Shape) / g.fill(Shape) is coming.
> ****
>
> Not easy task however, since paths are also drawn with same methods.****
>
> **
>

That's my problem. If I can get in and add simple attributes to carry text
info through to the surface that would be great. As it is it is very
difficult (not impossible) to hack the glyph stream. We have to assume that
text mainly occurs in rows of glyphs with the same y-coordinate, but this
varies slightly because of the different glyph origins.

>  **
>
> May be I’ll download new version and look deeper…****
>
> **
>

I, for one, would be grateful if you did! I thought I was
miscompiling/omitting some resource, etc. which caused different output. It
took a while to realise it was the version.

Having used (by mistake) an 0.7.0 version I have seen the marked progress
and want to thank and congratulate the PDFBox community for the current
position and momentum.

[Why don't I ask the authors for XML and SVG directly? That's a different
and political issue. If anyone is interested in helping liberate the
scientific literature legally then hacking PDFs is a major strategy.
Volunteers welcome!]

>  **
>
> P.

-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message