tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Clark Perkins (Jira)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-3131) PDFParserConfig default values were accidentally swapped
Date Fri, 10 Jul 2020 23:22:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-3131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17155789#comment-17155789
] 

Clark Perkins commented on TIKA-3131:
-------------------------------------

I'm pretty sure this was just an oversight when copying defaults from PDFBox, so I went ahead
and opened a PR to fix them.

> PDFParserConfig default values were accidentally swapped
> --------------------------------------------------------
>
>                 Key: TIKA-3131
>                 URL: https://issues.apache.org/jira/browse/TIKA-3131
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.24.1
>            Reporter: Clark Perkins
>            Priority: Major
>
> When default values were added for averageCharTolerance andĀ spacingTolerance as a part
of TIKA-3091, their values appear to have been inadvertently swapped.
> From PDFBox:
> {noformat}
>     private float spacingTolerance = .5f;
>     private float averageCharTolerance = .3f;
> {noformat}
> From tika 1.24.1:
> {noformat}
>     //The character width-based tolerance value used to estimate where spaces in text
should be added
>     //Default taken from PDFBox.
>     private Float averageCharTolerance = 0.5f;
>     //The space width-based tolerance value used to estimate where spaces in text should
be added
>     //Default taken from PDFBox.
>     private Float spacingTolerance = 0.3f;
> {noformat}
> This effective change in defaults has caused PDFParser to start adding more spaces than
it did in 1.24 and earlier.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message