pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "psconceicao@outlook.com" <psconcei...@outlook.com>
Subject Identify not visible characters - Overlapped characters
Date Tue, 27 Dec 2016 23:52:43 GMT
Hello everyone,

I am using PDFBox 1.8.12 (because I'm developing in C#) and I can extract all characters from
a PDF with the respective position.

My objective is to perform a layout analysis and try to reproduce the PDF layout in a text
However, I'm facing a huge problem: identify not visible characters.

In the annexed file, the text "Alandroal (Nossa Senhora da Conceic..." is using some space
used by the word "Rural" (row 5), but not visible.

I would like to someone help me to get a way to identify the text not visible, in order to
avoid those characters in the text file.

This approach: http://stackoverflow.com/questions/19809813/how-to-check-if-a-text-is-transparent-with-pdfbox
doesn't work in the annexed file (only works with images).

Many thanks in advance,
Paulo Sergio

View raw message