pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "psconceicao@outlook.com" <psconcei...@outlook.com>
Subject RE: Identify not visible characters - Overlapped characters
Date Wed, 28 Dec 2016 21:29:54 GMT
Hi Manuel,

I'm sorry for my mistake and many thanks for your help and attention.

The best tool that I know to extract text from a PDF ( I didn't test Monarch), maintaining
the correct layout, is inside a CAAT software: Caseware IDEA. However this software is very
expensive and does a lot of other things.

All the others tools that I tested (and I tested several) do wrong positioning analysis.

It will be good to develop a tool to produce similar results obtained with IDEA.

The work that you developed can help others to achieve that result.

-----Mensagem original-----
De: Manuel Aristarán [mailto:jazzido@jazzido.com] Em nome de Manuel Aristarán
Enviada: quarta-feira, 28 de dezembro de 2016 20:37
Para: users@pdfbox.apache.org
Assunto: Re: Identify not visible characters - Overlapped characters

Hi Paulo,

> On Dec 28, 2016, at 9:52 AM, psconceicao@outlook.com wrote:
> Unfortunately, Tabula uses a totally different approach (image 
> analysis) [...]

Sorry for going (sort of) off-topic, but that's not correct. In fact, Tabula does not support
images. Thanks to PDFBox, it "mines" text and graphical elements, and uses a set of heuristics
that attempt reconstruct a tabular structure.

> Tabula also do incoherent analysis when a table is larger than one 
> page, for that reason Tabula is far from being a good tool for text 
> extraction with correct positioning.

We always welcome bug reports (and patches!) :) [1]


[1] https://github.com/tabulapdf/tabula-java/issues

Manuel Aristarán <manuel@jazzido.com>

To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

View raw message