pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Warren Gallagher <warren.gallag...@apxconsult.com>
Subject Fwd: Re: Looking for some guidance on using PDFBox to analyze page content
Date Mon, 23 Mar 2015 17:55:27 GMT
 

Peter, 

Thanks for your information and suggestions. I have been messing about
with your code to convert PDF to SVG and am quite impressed with what it
does so far. Is there an API option or command-line switch to cause
image elements to generate their xlink:href attributes with data: URLs
(the whole base64 encoded thing)? 

Regards, 

Warren 

-------- Original Message -------- 

		SUBJECT:
 		Re: Looking for some guidance on using PDFBox to analyze page content

		DATE:
 		2015-03-20 10:08

		FROM:
 		Peter Murray-Rust <pm286@cam.ac.uk>

		TO:
 		"users@pdfbox.apache.org" <users@pdfbox.apache.org>

		REPLY-TO:

We do a great deal of this and have created two downstream packages
which
consume the output of PDFBox:

* https://bitbucket.org/petermr/pdf2svg/ [2] (which translates the PDF
into SVG)
* https://bitbucket.org/petermr/svg2xml [3] (which tries to convert the
SVG
into high-level constructs)

There are roughly 3 outputs from PDFBox that relate to the viewable page
(we deliberately ignore all metadata, dictionaries, etc as it is likely
to
be inconsistent)
* characters either through codepoints (often not Unicode,
unfortunately)
or though pixel-based glyphs
* bitmaps (raster) as Eliot mentions
* graphics paths (move, line, quadratic and cubic bezier).

It is possible for all of these to occur in the same area. However in
many
instances the "text" and the "graphics" are separated by whitespace. (We
cannot rely on the order of primitives). We can then use whitespace
heuristics to separate this into "text" , "graphics" and "pixel images".
(Note, however, that text could contain small pixel images for
characters,
amd also small paths for underlines, etc.).

Assuming that you have "clean" graphics - such as plots - it is possible
with a great deal of work to extract a reasonable guess at the original
primitives. (For example there is no "circle" or "rectangle" in PDF,
only
paths).

It depends on what your material is, how it was produced, what the
primitives are, etc. You are very welcome to try our software which is
all
Apache2 licensed.

On Fri, Mar 20, 2015 at 1:43 PM, Warren Gallagher <
warren.gallagher@apxconsult.com> wrote:

> Greetings, Is there a means to determine if a page contains: * vector graphics * raster
graphics (and what format) Regards, WARREN GALLAGHER - CTO warren.gallagher@apxconsult.com
M: 613-791-4987 W: 613-262-2601 Advance Property eXposure Canada Inc. 1755 Woodward Drive,
Suite 101, Ottawa, Ontario K2C 0P9 APXConsult.com [1] Links: ------ [1] http://apxconsult.com
[1]

-- 

-------------------------

WARREN GALLAGHER - CTO

warren.gallagher@apxconsult.com

M: 613-791-4987 W: 613-262-2601 Advance Property eXposure Canada Inc.
1755 Woodward Drive, Suite 101, Ottawa, Ontario K2C 0P9 APXConsult.com
[1] 

Links:
------
[1] http://apxconsult.com
[2] https://bitbucket.org/petermr/pdf2svg/
[3] https://bitbucket.org/petermr/svg2xml

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message