pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting (JIRA)" <j...@apache.org>
Subject [jira] Created: (PDFBOX-347) Spaces removed after text extraction
Date Mon, 04 Aug 2008 17:48:44 GMT
Spaces removed after text extraction

                 Key: PDFBOX-347
                 URL: https://issues.apache.org/jira/browse/PDFBOX-347
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
            Reporter: Jukka Zitting
            Priority: Minor

[Issue from SourceForge]

The spaces between words from the attached PDF file are removed upon text

I traced the code and found that the cause seems to be a "division by 0"
bug in PDCIDFont.java

In PDCIDFont.getAverageFontWidth(), widths is returned as null from

COSArray widths = (COSArray)font.getDictionaryObject( COSName.getPDFName(
"W" ) );

,causing characterCount to be 0.

The result is that the following line
float average = totalWidths / characterCount;

returns a NaN, which gets propagated up the method calls to result in the
spaces being removed.

I suggest the following fix, instead of:
float average = totalWidths / characterCount;

float average = defaultWidth;

if (characterCount > 0) {
average = totalWidths / characterCount;

[Comment on SourceForge]
Date: 2008-03-12 03:01
Sender: choongyong
Logged In: YES 
Originator: NO

Realised that I was considered not login when I raised the request. 
Sending this comment so that the developer can contact me.

[Comment on SourceForge]
Date: 2008-03-17 21:50
Sender: nobody
Logged In: NO 

I have noticed that there is no spaces between 2 words, if they are
separated by a new line (or the 2nd word is on the next line because it
reaches the right margin).

Could you correct please ?

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message