pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jukka Zitting <jukka.zitt...@gmail.com>
Subject Re: Extracting symbols from Text
Date Wed, 25 Aug 2010 11:51:46 GMT

On Tue, Aug 24, 2010 at 9:50 PM, Yogesh <yogeshp08@gmail.com> wrote:
> I have PDFs for scientific literature. I want to extract all the notations
> like alpha, beta, gamma, delta and some other symbols along with the text.
> The PDFTextStripper works fine and gives me the text.
> How can I get these symbols along with the text the way it occurs in the
> PDF?

Those symbols are probably coming from a special font for which there
isn't a mapping to Unicode. Without such a mapping PDFBox can't tell
what character to output for each symbol.

You can try to inspect the PDF document for the font that's used, and
look for existing CMap files for that font. In the worst case you may
need to construct such a character map yourself to teach PDFBox how to
interpret the symbols used in your document.


Jukka Zitting

View raw message