pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dan Liu" <139250...@qq.com>
Subject Re: 1 text line becomes 2 line after extraction
Date Thu, 21 Dec 2017 01:26:20 GMT
Hi, all:
    I'm using pdfbox 2.0.8, the test pdf file can download from  http://proj.gz-yibo.com:2880/nk7.pdf

eg: 
a text line in page 19:
7.放射性核素扫描应用133 氙或99m 锝-二乙三胺五乙酸(99mTc-DTPA)雾化吸人。99m
锝
becomes:
133 99m 99m 99m
7.放射性核素扫描应用 氙或 锝-二乙三胺五乙酸(Tc-DTPA)雾化吸人。
锝


------------------
  With best regards


Daniel


------------------ Original ------------------
From:  "139250065";<139250065@qq.com>;
Date:  Wed, Dec 20, 2017 10:39 AM
To:  "users"<users@pdfbox.apache.org>;

Subject:  1 text line becomes 2 line after extraction



such as:
1 line: 肺具有广泛的呼吸面积,成人的总呼吸面积约有100m2(3 亿-7.5 亿肺泡),在呼吸过程中,
  ‍

  ‍

becomes 2 lines after extraction:
2
肺具有广泛的呼吸面积,成人的总呼吸面积约有100m(3 亿-7.5 亿肺泡),在呼吸过程中,

since y coordinate of char '2' is smaller than other chars. 


------------------


with best regards


daniel
Mime
  • Unnamed multipart/alternative (inline, 8-Bit, 0 bytes)
View raw message