pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From European Neuroscience Center <mnachev.nscenter...@gmail.com>
Subject Re: Extract embedded SVG image from PDF file
Date Tue, 05 Mar 2019 23:07:03 GMT
The image in PDF file is already in SVG format - ie in XML format. I need
to find it and extract this XML part as file with SVG extension.


On Wed, Mar 6, 2019 at 12:57 AM Peter Murray-Rust <pm286@cam.ac.uk> wrote:

> I have been doing a lot of graphical extraction of scientific "images" ,
> but in general there is no algorithmic way.( I'd be happy to see if there
> is an overlap of our interests.)
>
> To simplify: The PDF stream consists of bitmaps (images), glyphs
> (characters with code points) and paths (a mixture of Move, Line, Quadratic
> and Cubic curves, with Close(Z)). I tend to use "image" for bitmaps and
> "plots", "diagrams" or "graphics" for non-bitmap graphics. A "plot"
> generally consists of characters, and paths (and sometimes small
> images/bitmaps). But paths can occur anywhere and a diagram is only defined
> by convention - either a whitespace border or a rectangular path surround.
> But characters can be created by paths (cursive glyphs) which are difficult
> to interpret, and small paths can be embedded within runs of glyphs. I
> convert these to SVG.
>
> In practice I attempt to identify diagrams by whitespace surrounds,
> borders, and formal identification such as "Figure 2." But some diagrams
> don't have captions (e.g. chemical reaction schemes. In other places paths
> are used as page decoration (e.g. think lines, publisher icons, etc.).
>
> So simple answer there is no formal way, but there are heuristics. I am
> making useful progress with this and can extract certain types of diagrams
> into SVG.
>
> see https://github.com/petermr/normami (warning it's complex and mostly
> created as a library).
>
>
> On Tue, Mar 5, 2019 at 10:34 PM European Neuroscience Center <
> mnachev.nscenter.eu@gmail.com> wrote:
>
> > Hi,
> >
> > What is the way to extract an embedded image, which is in SVG format from
> > an PDF file using PDFBox?
> >
> > If there is no such option, how to determine from where the embedded SVG
> > image starts and extract this XML part of the PDF file?
> >
> >
> > Regards,
> > Miro.
> >
>
>
> --
> Peter Murray-Rust
> Reader Emeritus in Molecular Informatics
> Unilever Centre, Dept. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message