pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: all spaces between english words is lost after extraction
Date Thu, 21 Dec 2017 10:34:31 GMT
Ignore my last post, I completely forgot what it was really about. I'll 
look at this matter again.

Tilman

Am 21.12.2017 um 10:43 schrieb Tilman Hausherr:
> Thanks, and yes, it is what I mentioned: the pages I looked at don't 
> have spaces. PDF is mostly a graphic format. Spaces are not needed, 
> glyphs are simply put to the correct position.
>
> Tilman
>
>
>
> Am 21.12.2017 um 02:21 schrieb Dan Liu:
>> Hello all:
>>      I'm using pdfbox 2.0.8, the test pdf file can download from  
>> http://proj.gz-yibo.com:2880/nk7.pdf
>>
>> ------------------
>> With best regards
>> Daniel
>>
>>
>>
>>
>>
>>
>>
>> ------------------ Original ------------------
>> From:  "Tilman Hausherr";<THausherr@t-online.de>;
>> Date:  Wed, Dec 20, 2017 04:43 PM
>> To:  "users"<users@pdfbox.apache.org>;
>>
>> Subject:  Re: all spaces between english words is lost after extraction
>>
>>
>>
>> Hi,
>>
>> Please upload your file to a sharehoster. Also mention what PDFBox
>> version you are using.
>>
>> If the PDF doesn't have spaces (most PDFs don't), then you won't get any
>> positions.
>>
>> High level PDFBox text extraction (i.e. just get text) creates spaces by
>> using heuristics.
>>
>> Tilman
>>
>> Am 20.12.2017 um 03:46 schrieb Dan Liu:
>>> Hello all:
>>>      I extract the text according to the codes of
>>> https://www.tutorialkart.com/pdfbox/how-to-extract-coordinates-or-position-of-characters-in-pdf/

>>>
>>> , but all spaces between english words are lost.
>>>
>>> Such as:
>>> "severe acute respiratory syndrome"
>>>
>>> becomes:
>>> severeacuterespiratorysyndrome
>>>
>>> The attachment is origianl text.
>>>
>>>
>>> ------------------
>>>
>>> With best regards
>>> Daniel
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message