pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Lehmkuehler <andr...@lehmi.de>
Subject Re: Fwd: Junk Characters while Extracting text from pdf file.
Date Thu, 07 Feb 2013 06:35:35 GMT

Am 05.02.2013 22:06, schrieb Peter Murray-Rust:
> On Tue, Feb 5, 2013 at 6:36 PM, Andreas Lehmkuehler <andreas@lehmi.de>wrote:
>> Hi,
>> Am 05.02.2013 15:01, schrieb kulbhushan singh:
>>   Hi,
>>> I am trying to extract text from a pdf file with custom fonts but it is
>>> giving me junk characters. The fonts used are ArialMT (embedded subset) &
>>> Arial-BoldMT (embedded subset). The producer of pdf file is GPL Ghost
>>> script 8.15. I am using PDFTextStripper to extract the text. How can do it
>>> for custom fonts. Any reference or solution would be appreciated.
>> Did you do the "adobe" test? [1]
> Does this require buying Adobe Acrobat? Or is there a free version?
No, just open the pdf in question using adobe reader, mark the (whole)
text and try to copy and paste it to an editor. "File -> Save as text"
should do the same. If both don't work the text can't be extracted. In
most cases a mapping to readable text is missing which it is not required
to render/print the pdf but to extract the text.

> I have created heuristics for about 100 of these non-conformant fonts (
> http://bitbucket.org/petermr/pdf2svg which uses PDFBox). If you mail me a
> sample file I can see whether these would help. I have done several TeX
> fonts (CMM etc.) but haven't done a Ghostcript one and it would be useful
> But as Andreas says, ultimately these are probably non-conformant. A mixure
No, a missing mapping doesn't lead to a non-conformant pdf. It is still

> of heuristics and glyph analysis (OCR and or heuristics) are required.
> Again PDF2SVG is addressing these - any community involvement is valued.
Yes, that's the only workaround I know. Create an image for each page and
use some OCR software to get the text out of it.

Andreas Lehmkühler

View raw message