pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: Identify not visible characters - Overlapped characters
Date Wed, 28 Dec 2016 11:18:38 GMT
Am 28.12.2016 um 00:52 schrieb psconceicao@outlook.com:
> Hello everyone,
> I am using PDFBox 1.8.12 (because I’m developing in C#) and I can 
> extract all characters from a PDF with the respective position.
> My objective is to perform a layout analysis and try to reproduce the 
> PDF layout in a text file.
> However, I’m facing a huge problem: identify not visible characters.
> In the annexed file, the text “Alandroal (Nossa Senhora da Conceic…” 
> is using some space used by the word “Rural” (row 5), but not visible.

Ooohhhh... your file shows one interesting effect, which may or may not 
be a bug: text extraction shows more data than in rendering. "Alandroal 
(Nossa Senhora da Conceicao)" is extracted in full, but in rendering we 
only see "Alandroal (Nossa Senhora da Co" due to clipping.

This may need a change deeply in PDFBox itself, i.e. check whether a 
glyph is in the clipping region or not. For that, you'd need to have a 
look at PageDrawer.java, and copy all clipping operations to the text 
stripper (or extend the text stripper). I'd rather recommend to do this 
with the 2.0 version, to avoid a lot of work to move from 1.8 to 2.0 at 
a later time.

Try also https://github.com/tabulapdf/ , I wonder how they handle this 


> I would like to someone help me to get a way to identify the text not 
> visible, in order to avoid those characters in the text file.
> This approach: 
> http://stackoverflow.com/questions/19809813/how-to-check-if-a-text-is-transparent-with-pdfbox

> doesn’t work in the annexed file (only works with images).
> Many thanks in advance,
> Paulo Sergio
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message