pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Murray-Rust <pm...@cam.ac.uk>
Subject Re: Re: Looking for some guidance on using PDFBox to analyze page content
Date Mon, 23 Mar 2015 19:19:30 GMT
[Last message on this thread to PDFBox users - in case there may be others
interested in the current state].

Thanks,

I wrote this about 3-4 years ago. I work in chunks - at present we've
developed an Open pipeline to read the complete scientific literature (
contentmine.org) and have parked the PDF2SVG and SVG2XML facility for a
bit. The system is sufficiently complex that we've developed a commandline
structure and declarative approach which will be useful for PDF2SVG and -
if no one else does - I intend to revisit this summer.

Part of the problem is that we create intermediate SVG files and these
began to overwhelm us. We've redesigned to system so that files will be
held in directories and can transport easily.

The next features will include:
 * commandline structure
 * directory structure
 * interpretation of path-based characters (both analytic and lookup)
 * interpretation of pixel based characters (analytic and lookup - quite a
lot achieved but more to go)

Originally I planned to do these by heuristics but we are optimistic of
generating a community who can develop templates for common layouts or
glyphs.

Of course anyone is welcome to fork and make pull-requests :-)



On Mon, Mar 23, 2015 at 5:55 PM, Warren Gallagher <
warren.gallagher@apxconsult.com> wrote:

>
>
> Peter,
>
> Thanks for your information and suggestions. I have been messing about
> with your code to convert PDF to SVG and am quite impressed with what it
> does so far. Is there an API option or command-line switch to cause
> image elements to generate their xlink:href attributes with data: URLs
> (the whole base64 encoded thing)?
>
> Regards,
>
> Warren
>
> -------- Original Message --------
>
>                 SUBJECT:
>                 Re: Looking for some guidance on using PDFBox to analyze
> page content
>
>                 DATE:
>                 2015-03-20 10:08
>
>                 FROM:
>                 Peter Murray-Rust <pm286@cam.ac.uk>
>
>                 TO:
>                 "users@pdfbox.apache.org" <users@pdfbox.apache.org>
>
>                 REPLY-TO:
>
> We do a great deal of this and have created two downstream packages
> which
> consume the output of PDFBox:
>
> * https://bitbucket.org/petermr/pdf2svg/ [2] (which translates the PDF
> into SVG)
> * https://bitbucket.org/petermr/svg2xml [3] (which tries to convert the
> SVG
> into high-level constructs)
>
> There are roughly 3 outputs from PDFBox that relate to the viewable page
> (we deliberately ignore all metadata, dictionaries, etc as it is likely
> to
> be inconsistent)
> * characters either through codepoints (often not Unicode,
> unfortunately)
> or though pixel-based glyphs
> * bitmaps (raster) as Eliot mentions
> * graphics paths (move, line, quadratic and cubic bezier).
>
> It is possible for all of these to occur in the same area. However in
> many
> instances the "text" and the "graphics" are separated by whitespace. (We
> cannot rely on the order of primitives). We can then use whitespace
> heuristics to separate this into "text" , "graphics" and "pixel images".
> (Note, however, that text could contain small pixel images for
> characters,
> amd also small paths for underlines, etc.).
>
> Assuming that you have "clean" graphics - such as plots - it is possible
> with a great deal of work to extract a reasonable guess at the original
> primitives. (For example there is no "circle" or "rectangle" in PDF,
> only
> paths).
>
> It depends on what your material is, how it was produced, what the
> primitives are, etc. You are very welcome to try our software which is
> all
> Apache2 licensed.
>
> On Fri, Mar 20, 2015 at 1:43 PM, Warren Gallagher <
> warren.gallagher@apxconsult.com> wrote:
>
> > Greetings, Is there a means to determine if a page contains: * vector
> graphics * raster graphics (and what format) Regards, WARREN GALLAGHER -
> CTO warren.gallagher@apxconsult.com M: 613-791-4987 W: 613-262-2601
> Advance Property eXposure Canada Inc. 1755 Woodward Drive, Suite 101,
> Ottawa, Ontario K2C 0P9 APXConsult.com [1] Links: ------ [1]
> http://apxconsult.com [1]
>
> --
>
> -------------------------
>
> WARREN GALLAGHER - CTO
>
> warren.gallagher@apxconsult.com
>
> M: 613-791-4987 W: 613-262-2601 Advance Property eXposure Canada Inc.
> 1755 Woodward Drive, Suite 101, Ottawa, Ontario K2C 0P9 APXConsult.com
> [1]
>
> Links:
> ------
> [1] http://apxconsult.com
> [2] https://bitbucket.org/petermr/pdf2svg/
> [3] https://bitbucket.org/petermr/svg2xml
>



-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message