pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Murray-Rust <pm...@cam.ac.uk>
Subject Re: How to ensure a PDF is valid
Date Tue, 22 Jan 2013 10:51:06 GMT
On Tue, Jan 22, 2013 at 6:54 AM, Guillaume Bailleul
<gbm.bailleul@gmail.com>wrote:

> Hi All,
>
> Is there a way to validate a pdf with PDFBox? I mean to ensure that
> the document complies with the PDF Reference.
>
> My idea was to load the document and then ensure each xobject is
> parsed (retrieving each one). Is it a good way to do it ?
>
> I am also very interested in this. In the PDF2SVG project (
https://bitbucket.org/petermr/pdf2svg) we convert non-standard PDFs to
Unicode characters and SVG. If the input was PDF-reference-compliant (e.g.
used the standard 14 fonts and Unicode) our job would be relatively easy.

However we are working with STM publications (ScientificTechnicalMedical)
which seem to be very non-compliant. Sadly the worst compliance comes in
the mathematical and symbol components. Many fonts are proprietary and so
we have developed heuristics developed by manual inspection which map to
Unicode. Other fonts derive from (say) Mathematical-PI which uses
proprietary codes (e.g. H11001 for "plus") and where there is no published
mapping. (There is a great tool, shapecatcher.com, which allows you to look
up many Unicode characters from the glyph).

 In many cases it may be possible to re-emit compliant PDF (although my
current primary interest is to determine the Unicode point and do semantic
processing). It should therefore be possible to create a PDFTidy tool which
removes the non-compliance (Cf HTMLTidy). Maybe a little of the kerning
would be lost but I think standards is a good idea!

Question. Is Symbol *necessary* in the PDF spec or can equivalent
functionality be found in Unicode codepoints?

I have currently hacked about 55 fonts - normally only the characters I
discover in the wild, see
https://bitbucket.org/petermr/pdf2svg/src/905f2fa94bcf/src/main/resources/org/xmlcml/pdf2svg/fontFamilySets/nonStandardFontFamilySet.xml?at=defaultand
https://bitbucket.org/petermr/pdf2svg/src/905f2fa94bcf5e8d3ea17eabd7bc94b53bd02ae8/src/main/resources/org/xmlcml/pdf2svg/codepoints?at=default.
Any insight or contributions here would be very valuable - and please feel
free to fork and develop it.

FWIW the next phase (SVGPlus) uses heuristics recreate paragraphs and other
objects (super/subscripts, maths equations, tables, semantic graphs). The
third phase turns these into semantic chemistry, biology, etc. - all from
the PDF.

 P.


-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message