pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Lehmkühler (JIRA) <j...@apache.org>
Subject [jira] [Closed] (PDFBOX-1823) Apache PDFBox 1.6.0 TextStripper not able to recognise characters having "Frutiger LT - 45" fonts
Date Thu, 02 Jan 2014 16:10:52 GMT

     [ https://issues.apache.org/jira/browse/PDFBOX-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Andreas Lehmkühler closed PDFBOX-1823.

    Resolution: Not A Problem
      Assignee: Andreas Lehmkühler

The text can't be extracted. The pdf doesn't contain any information to map the internal glyph
id to readable text.

The only workaround I know is to convert every single page of the pdf to an image and pass
the result to an OCR software. But I guess that is very handy ...

Anyway, I've closed this issue, as there isn't any problem with PDFBox. If you have any further
questions please address those to one of our the mailing lists. See [1] on how to subscribe
to it.

[1] http://pdfbox.apache.org/mailinglists.html

> Apache PDFBox 1.6.0 TextStripper not able to recognise characters having "Frutiger LT
- 45" fonts
> -------------------------------------------------------------------------------------------------
>                 Key: PDFBOX-1823
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1823
>             Project: PDFBox
>          Issue Type: Bug
>          Components: FontBox
>    Affects Versions: 1.6.0
>         Environment: jdk1.6
>            Reporter: Chitrang Natu
>            Assignee: Andreas Lehmkühler
>              Labels: newbie
>         Attachments: PDF_With_Frutiger_font.pdf, TC01_output.concat.MD302AE_Part2.doc,
Test_Frutiger.java, fontbox-checkstyle.xml, pdfbox-checkstyle.xml, pom.xml
>   Original Estimate: 504h
>  Remaining Estimate: 504h
> When i tried to extract contents from PDF's I am successfully able to extract all text
with PDFBox API but getting trouble with fonts having 'Frutiger' style. For these i am getting
squared Boxes in place of characters.
> It seems PDFBox FontBox supports only 14 UTF characters set  And none of them is Frutiger
style fonts. 
> If anybody please can suggest something. That would be of great help. I am in urgent
need of the solution.

This message was sent by Atlassian JIRA

View raw message