pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Lehmkuehler <andr...@lehmi.de>
Subject Re: [Identity-H][Text Extraction] Overwrite toUnicode Mapping
Date Sun, 30 Oct 2016 15:22:08 GMT

Am 30.10.2016 um 07:46 schrieb Maryam Z:
> Hi,
> I am trying to extract Sinhala and Tamil text from PDFs, and am facing a
> problem extracting text correctly when the PDF uses Unicode Fonts "Iskoola
> Pota" (Sinhala) or "Latha" (Tamil).
> While the extraction works as expected when the encoding is WinAnsi, if the
> encoding is "Identity-H" some letters tend to be jumbled (valid Sinhala or
> Tamil characters, but wrong) and the jumbled letters differ from PDF to
> PDF. This is because the toUnicode table for such PDFs are incorrect,
> mapping glyphs to the wrong Unicode values.
> I came across the solution for the Identity-h problem for CJK fonts using
> CMap files, but the CMap files for these two fonts are not available.
> I would be grateful if you could let me know if there is any way to
> overwrite the toUnicode map and use a custom map in extraction, which
> correctly maps glyphs to values, or if there is any other effective
> solution for this problem.
Did you perform the "acrobat test", see [1] ?

What version of PDFBox are you using?

Can you share a sample pdf with us (provide a link to a public download 

> Thank you!


[1] http://pdfbox.apache.org/2.0/faq.html#notext

To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

View raw message