pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: all spaces between english words is lost after extraction
Date Thu, 21 Dec 2017 09:43:09 GMT
Thanks, and yes, it is what I mentioned: the pages I looked at don't 
have spaces. PDF is mostly a graphic format. Spaces are not needed, 
glyphs are simply put to the correct position.

Tilman



Am 21.12.2017 um 02:21 schrieb Dan Liu:
> Hello all:
>      I'm using pdfbox 2.0.8, the test pdf file can download from  http://proj.gz-yibo.com:2880/nk7.pdf
>
> ------------------
> With best regards
> Daniel
>
>
>
>   
>
>
>
>
> ------------------ Original ------------------
> From:  "Tilman Hausherr";<THausherr@t-online.de>;
> Date:  Wed, Dec 20, 2017 04:43 PM
> To:  "users"<users@pdfbox.apache.org>;
>
> Subject:  Re: all spaces between english words is lost after extraction
>
>
>
> Hi,
>
> Please upload your file to a sharehoster. Also mention what PDFBox
> version you are using.
>
> If the PDF doesn't have spaces (most PDFs don't), then you won't get any
> positions.
>
> High level PDFBox text extraction (i.e. just get text) creates spaces by
> using heuristics.
>
> Tilman
>
> Am 20.12.2017 um 03:46 schrieb Dan Liu:
>> Hello all:
>>      I extract the text according to the codes of
>> https://www.tutorialkart.com/pdfbox/how-to-extract-coordinates-or-position-of-characters-in-pdf/
>> , but all spaces between english words are lost.
>>
>> Such as:
>> "severe acute respiratory syndrome"
>>
>> becomes:
>> severeacuterespiratorysyndrome
>>
>> The attachment is origianl text.
>>
>>
>> ------------------
>>
>> With best regards
>> Daniel
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message