pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: all spaces between english words is lost after extraction
Date Thu, 21 Dec 2017 10:44:41 GMT
Am 20.12.2017 um 03:46 schrieb Dan Liu:
> Such as:
> "severe acute respiratory syndrome"
> becomes:
> severeacuterespiratorysyndrome

I look at that again... here's what I get with the ExtractText utility:

是卡氏肺囊虫肺炎。从 2002 
重急性呼吸综合征,severe acute respiratory syndrome, 

So that is correct. What you (or that website) did is the mid-level 
thing, i.e. look at position of glyphs. There you won't find the spaces 
because there aren't any:

So my posting was correct... there are no spaces in the PDF. The spaces 
are added by us.

You can get this colorful output with the DrawPrintTextPositions example 
from the source code.


View raw message