pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject tracking missing Unicode mappings?
Date Thu, 21 Sep 2017 20:07:09 GMT

How much effort would it be to track/calculate a ratio of characters with missing Unicode
mappings to those with mappings for a given page?  It would be neat after trying to extract
text from a page to be able to tell how many characters are lost.  We could use this info
on Tika to determine whether or not to run OCR on a given page.

I see that there’s currently a Set<String> for tracking which characters have a missing
Unicode mapping to limit duplicate logging.  If we could change that to a Map<String,Int>
we could track the occurrences.

Is there an easy enough way to get the fonts after processing a page and then getting this
info?  Are we doing any static caching of fonts that would prevent accurate counts?

Thank you.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message