pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Manuel Aristarán <man...@jazzido.com>
Subject Re: Identify not visible characters - Overlapped characters
Date Wed, 28 Dec 2016 20:36:54 GMT
Hi Paulo,

> On Dec 28, 2016, at 9:52 AM, psconceicao@outlook.com wrote:
> 
> Unfortunately, Tabula uses a totally different approach (image analysis)
> [...]

Sorry for going (sort of) off-topic, but that's not correct. In fact, Tabula does not support
images. Thanks to PDFBox, it "mines" text and graphical elements, and uses a set of heuristics
that attempt reconstruct a tabular structure.

> Tabula also do incoherent analysis when a table is larger than one page, for
> that reason Tabula is far from being a good tool for text extraction with
> correct positioning.

We always welcome bug reports (and patches!) :) [1]

Thanks!

[1] https://github.com/tabulapdf/tabula-java/issues


—
Manuel Aristarán <manuel@jazzido.com>
http://jazzido.com




---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message