pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Murray-Rust <pm...@cam.ac.uk>
Subject Re: problem with pdf eof
Date Fri, 10 Oct 2014 14:05:10 GMT
On Fri, Oct 10, 2014 at 2:33 PM, Maruan Sahyoun <sahyoun@fileaffairs.de>
wrote:

> Hi Marc,
>
> text and image extraction is one of the regular use cases. Keeping the
> formatting is also possible but there is a different concept behind the PDF
> format and text processing. E.g. what is a paragraph within a text
> processor might be individually placed characters (glyphs) within a PDF
> file. You might want to look into PDFStreamEngine and it’s subclasses how
> to process graphics and text information of a PDF.
>
> Another sample is PDF2SVG which uses PDFBox [
> https://bitbucket.org/petermr/pdf2svg/wiki/Home]
>

Thanks for the link. see also http://www.contentmine.org

The PDF2SVG project is active and the first part of a pipeline which
includes:

PDF -> (SVG, PNG) -> (SVG, XHTML, PNG) -> (SVG, XHTML, SVG) (where bitmaps
have been converted to SVG) -> (Shapes, Text) -> Semantic Documents ->
Science

We are now able to take (most) PDFs and extract primitives which are
heuristically combined to create Characters and Paths, which are combined
to Shapes and Text. This is structured into XHTML, along with
sub/superscripts and styling (italics). In favourable cases we can extract
semantic science (currently evolutionary trees from pixel diagrams in PDFs,
and chemical reactions also from pixels in PDFs).


We have to do a significant amount of OCR because (a) diagrams have
characters in pixels and (b) scientific publishers use the worst-ever
non-compliant Fonts in their PDFs. This means we have to guess the
character / codePoint from the outline glyph or pixel map.

Some of this is good beta, some is raw alpha. We'd be delighted if anyone
is interested in hacking pixels or glyph outlines in PDFs - it's painful
but you get a warm glow of having helped the human race. Same goes for
tables and document structuring...

BR

P




-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message