pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Murray-Rust <pm...@cam.ac.uk>
Subject Re: Fwd: Junk Characters while Extracting text from pdf file.
Date Tue, 05 Feb 2013 21:06:28 GMT
On Tue, Feb 5, 2013 at 6:36 PM, Andreas Lehmkuehler <andreas@lehmi.de>wrote:

> Hi,
>
> Am 05.02.2013 15:01, schrieb kulbhushan singh:
>
>  Hi,
>>
>> I am trying to extract text from a pdf file with custom fonts but it is
>> giving me junk characters. The fonts used are ArialMT (embedded subset) &
>> Arial-BoldMT (embedded subset). The producer of pdf file is GPL Ghost
>> script 8.15. I am using PDFTextStripper to extract the text. How can do it
>> for custom fonts. Any reference or solution would be appreciated.
>>
> Did you do the "adobe" test? [1]
>

Does this require buying Adobe Acrobat? Or is there a free version?

I have created heuristics for about 100 of these non-conformant fonts (
http://bitbucket.org/petermr/pdf2svg which uses PDFBox). If you mail me a
sample file I can see whether these would help. I have done several TeX
fonts (CMM etc.) but haven't done a Ghostcript one and it would be useful

But as Andreas says, ultimately these are probably non-conformant. A mixure
of heuristics and glyph analysis (OCR and or heuristics) are required.
Again PDF2SVG is addressing these - any community involvement is valued.



-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message