pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Murray-Rust <pm...@cam.ac.uk>
Subject Interpreting vector and pixel glyphs for characters
Date Tue, 24 Mar 2015 13:21:07 GMT
On Tue, Mar 24, 2015 at 9:26 AM, Maruan Sahyoun <sahyoun@fileaffairs.de>
wrote:

>... As you would like to remove certain vectors which are matching a
certain >character/glyph you first need to find out which are the ones
drawing e.g. the letter >'T'. I don't think that this is doable in a
reasonable amount of time for arbitary text.

>Maruan

This is true! And it's unfortunately a common problem with PDFs which use
* outline fonts/glyphs
* pixel glyphs
* scanned text

I think it is possible in limited subdomains and we are starting to try to
do this in science/maths. Our approach (
https://bitbucket.org/petermr/diagramanalyzer,
https://bitbucket.org/petermr/imageanalysis,
https://bitbucket.org/petermr/javaocr) is to create tools that recognize
text in common fonts. Unfortunately there is no clear library for OCR in
Java (we looked at all of them - Tesseract is non-native - and have ended
up extending javaocr).

Scanned typescript can be a nightmare (missing pixels, bleeding across
glyph boundaries, etc.) but sometimes works.
In our approach we try to analyze born-digital glyphs by heuristics rather
than machine-learning (which needs retraining for all new fonts/size). The
vector glyphs have a constant SVG signature for each character and this can
sometimes be worked out, or mapped by the crowd). The pixel glyphs are
harder and we shrink them to a common skeleton and classify from that. Once
one character is done it's usually possible to recognize it in later
occurrences.

It's early days, but it people are interested in collaborating or have
better solutions we'd be interested (we aren't able to help with casual
problems).

P.




-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message