pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Augusto Ribeiro Silva <...@unsilo.com>
Subject Re: Weird spacing in words
Date Wed, 01 Jun 2016 11:59:55 GMT
Hi,

Tweaking the parameters didn’t help. 
Here is a part of the pdf in question - https://dl.dropboxusercontent.com/u/2456015/problem.pdf

Best regards,
Augusto

> On 31 May 2016, at 22:44, Tilman Hausherr <THausherr@t-online.de> wrote:
> 
> Looks like a different problem. Assuming you're using the latest version, you might want
to try setting
> 
> PDFTextStripper.setSpacingTolerance()
> 
> the default is 0.5f
> 
> So try some values slightly above or below, i.e. 0.4f, 0.6f, etc.
> 
> another one is
> 
> setAverageCharTolerance()
> 
> the default is 0.3f.
> 
> Tilman
> 
> Am 31.05.2016 um 22:36 schrieb Augusto Ribeiro Silva:
>> Hi,
>> 
>> PDFDebugger shows the following.
>>  (The ) Tj
>>   22.7679 0 Td
>>   (es t) Tj
>>   12.2023 0 Td
>>   (ab lis) Tj
>>   20.7981 0 Td
>>   (h m) Tj
>>   14.0054 0 Td
>>   (ent ) Tj
>>   19.1013 0 Td
>>   (of ) Tj
>>   14.83369 0 Td
>>   (an ) Tj
>>   16.0359 0 Td
>>   (in te gr) Tj
>>   25.72701 0 Td
>>   (ate) Tj
>>   12.80299 0 Td
>>   (d ) Tj
>> 
>> I am not sure if it is the same problem. I will try to get permission to upload the
document somewhere tomorrow.
>> 
>> Best regards,
>> Augusto
>> 
>>> On 31 May 2016, at 18:23, Tilman Hausherr <THausherr@t-online.de> wrote:
>>> 
>>> Please upload the file somewhere. If you've used PDFDebugger before, have a look
here:
>>> https://issues.apache.org/jira/browse/PDFBOX-3248
>>> and then look at your content stream whether it is the same problem.
>>> 
>>> Tilman
>>> 
>>> Am 31.05.2016 um 15:22 schrieb Augusto Ribeiro Silva:
>>>> Hi all,
>>>> 
>>>> I am using PDFBox java library to read the content of some PDFs and it seems
like it inserts some weird (hyphen-like) spacing. I get the same result using the PDFBox-App
command line util.
>>>> 
>>>> The es tab lish ment of an in te grated Part ner Re la tion ship Man age
ment (PRM) sys tem can po ten tially ad dress sev eral as pets
>>>> 
>>>> I tried to extract text from the same PDF using the pdftotext command line
utility it extracts the text correctly:
>>>> The establishment of an integrated Partner Relationship Management (PRM)
system can potentially address several aspects
>>>> 
>>>> Does somebody have any idea why PDFBox behaves in this way and any tips to
fixing it? I am using TIKA but as I understood TIKA uses PDFBox for PDF processing underneath.
>>>> 
>>>> Best regards,
>>>> Augusto
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message