pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject RE: tracking missing Unicode mappings?
Date Thu, 21 Sep 2017 21:16:59 GMT
Perfect.  Thank you.  I'll open an issue and draft a patch.

-----Original Message-----
From: Tilman Hausherr [mailto:THausherr@t-online.de] 
Sent: Thursday, September 21, 2017 4:21 PM
To: users@pdfbox.apache.org
Subject: Re: tracking missing Unicode mappings?

The standard 14 fonts are cached, but these shouldn't bring any text extraction trouble.

So all needed would be a map as described for the PDFont type.

Now how to access the fonts... if you grab the TextPosition objects in an extension of PDFTextStripper
(e.g. in the showGlyph method)  you could get the PDFont objects there.

But this is all just a thought. I did not implement anything.

Tilman

Am 21.09.2017 um 22:07 schrieb Allison, Timothy B.:
> All,
>
> How much effort would it be to track/calculate a ratio of characters with missing Unicode
mappings to those with mappings for a given page?  It would be neat after trying to extract
text from a page to be able to tell how many characters are lost.  We could use this info
on Tika to determine whether or not to run OCR on a given page.
>
> I see that there’s currently a Set<String> for tracking which characters have
a missing Unicode mapping to limit duplicate logging.  If we could change that to a Map<String,Int>
we could track the occurrences.
>
> Is there an easy enough way to get the fonts after processing a page and then getting
this info?  Are we doing any static caching of fonts that would prevent accurate counts?
>
> Thank you.
>
>           Best,
>
>                    Tim



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message