pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maryam Z <maryamuom...@gmail.com>
Subject [Identity-H][Text Extraction] Overwrite toUnicode Mapping
Date Sun, 30 Oct 2016 06:46:21 GMT

I am trying to extract Sinhala and Tamil text from PDFs, and am facing a
problem extracting text correctly when the PDF uses Unicode Fonts "Iskoola
Pota" (Sinhala) or "Latha" (Tamil).

While the extraction works as expected when the encoding is WinAnsi, if the
encoding is "Identity-H" some letters tend to be jumbled (valid Sinhala or
Tamil characters, but wrong) and the jumbled letters differ from PDF to
PDF. This is because the toUnicode table for such PDFs are incorrect,
mapping glyphs to the wrong Unicode values.

I came across the solution for the Identity-h problem for CJK fonts using
CMap files, but the CMap files for these two fonts are not available.

I would be grateful if you could let me know if there is any way to
overwrite the toUnicode map and use a custom map in extraction, which
correctly maps glyphs to values, or if there is any other effective
solution for this problem.

Thank you!

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message