pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "psconceicao@outlook.com" <psconcei...@outlook.com>
Subject RE: Identify not visible characters - Overlapped characters
Date Wed, 28 Dec 2016 12:52:28 GMT
Thank you.

Unfortunately, Tabula uses a totally different approach (image analysis)
that works well for the exposed problem, but in others do a lot of errors.
Tabula also do incoherent analysis when a table is larger than one page, for
that reason Tabula is far from being a good tool for text extraction with
correct positioning.

I'm going to study the clipping operators, and try to solve this kind of

The type of file shown is very common in result of printing spreadsheets to
PDF, especially in cases where the content of one cell is larger than what
is seen in the screen, because the next cell also has content. For that
reason, I really have to solve this kind of problem.

Tilman, do you know any peace of code to give a little help?

Many thanks!


-----Mensagem original-----
De: Tilman Hausherr [mailto:THausherr@t-online.de] 
Enviada: quarta-feira, 28 de dezembro de 2016 11:19
Para: users@pdfbox.apache.org
Assunto: Re: Identify not visible characters - Overlapped characters

Am 28.12.2016 um 00:52 schrieb psconceicao@outlook.com:
> Hello everyone,
> I am using PDFBox 1.8.12 (because I'm developing in C#) and I can 
> extract all characters from a PDF with the respective position.
> My objective is to perform a layout analysis and try to reproduce the 
> PDF layout in a text file.
> However, I'm facing a huge problem: identify not visible characters.
> In the annexed file, the text "Alandroal (Nossa Senhora da Conceic." 
> is using some space used by the word "Rural" (row 5), but not visible.

Ooohhhh... your file shows one interesting effect, which may or may not be a
bug: text extraction shows more data than in rendering. "Alandroal (Nossa
Senhora da Conceicao)" is extracted in full, but in rendering we only see
"Alandroal (Nossa Senhora da Co" due to clipping.

This may need a change deeply in PDFBox itself, i.e. check whether a glyph
is in the clipping region or not. For that, you'd need to have a look at
PageDrawer.java, and copy all clipping operations to the text stripper (or
extend the text stripper). I'd rather recommend to do this with the 2.0
version, to avoid a lot of work to move from 1.8 to 2.0 at a later time.

Try also https://github.com/tabulapdf/ , I wonder how they handle this


> I would like to someone help me to get a way to identify the text not 
> visible, in order to avoid those characters in the text file.
> This approach: 
> http://stackoverflow.com/questions/19809813/how-to-check-if-a-text-is-
> transparent-with-pdfbox doesn't work in the annexed file (only works 
> with images).
> Many thanks in advance,
> Paulo Sergio
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org

To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

View raw message