pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Lehmkuehler <andr...@lehmi.de>
Subject Re: Word prefixes fi, fl go missing in text produced by pdfbox-app v 1.8.3 to 1.8.5
Date Sun, 04 May 2014 16:51:26 GMT
Hi,

Am 02.05.2014 07:49, schrieb Anupama Krishnan:
> Hello,
>
> I ran pdfbox-app version 1.8.5 over the PDF Greenstone manual:
> http://www.greenstone.org/docs/greenstone3/manual.pdf
>
> It removed the fl and fi prefixes from words like "flexible", "file" and
> "first". Perhaps these genuine word prefixes have been confused with f-based
> ligatures?
>
> We were previously using a pdfbox-app 1.5.* version and wanted to switch over to
> a newer one. Version 1.8.2 does not have this issue.
>
>
> The command we ran:
> java -cp "/path/to/pdfbox-app-1.8.5.jar" -Dline.separator="<br />"
> org.apache.pdfbox.ExtractText -html "/path/to/manual.pdf"
>
> Relevant excerpts from the output generated:
> - "improve exibility, modularity, and extensibility"
> the 2nd word should be "flexibillity"
> - "Table 1 shows the le hierarchy for Greenstone3. The rst part shows the common"
> The words "file" and "first" have been truncated to "le" and "rst"
>
> I believe this is rather a bug than intended behaviour.
Yes, I can reproduce that behaviour and created an issue [1] on JIRA.


> Kind regards,
> Anupama
Thanks for the report

BR
Andreas Lehmkühler

[1] https://issues.apache.org/jira/browse/PDFBOX-2058


Mime
View raw message