pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: all spaces between english words is lost after extraction
Date Thu, 21 Dec 2017 10:44:41 GMT
Am 20.12.2017 um 03:46 schrieb Dan Liu:
>
> Such as:
> "severe acute respiratory syndrome"
>
> becomes:
> severeacuterespiratorysyndrome


I look at that again... here's what I get with the ExtractText utility:


化及免疫低下性肺部感染等疾病发病率日渐增多。艾滋病的主要死亡原因为肺部感染,特别
是卡氏肺囊虫肺炎。从 2002 
年底以来,在我国及世界范围内暴发的传染性非典型肺炎(严
重急性呼吸综合征,severe acute respiratory syndrome, 
SARS)疫情,由于多发生于中
青年,其传染性强,病死率高,又缺乏针对性的药物,因而引起了群众的恐慌,同时给国民


So that is correct. What you (or that website) did is the mid-level 
thing, i.e. look at position of glyphs. There you won't find the spaces 
because there aren't any:



So my posting was correct... there are no spaces in the PDF. The spaces 
are added by us.

You can get this colorful output with the DrawPrintTextPositions example 
from the source code.

Tilman



Mime
View raw message