pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Logan <John.Lo...@texture.com>
Subject Re: Identify not visible characters - Overlapped characters
Date Thu, 29 Dec 2016 06:22:19 GMT
Hi Paulo,

Is your layout analysis focused on extracting tabular data (records) from a PDF file?  Or
are you trying to handle more general layouts?

PDFBOX-2998 contains detailed discussion about enhancing the extraction algorithms, including
adding advanced layout analysis.  The argument against this is that it's very hard to simultaneously
achieve high quality and general applicability.

The current text extractor allows a developer to override the text output methods, but the
core is fairly monolithic.  It'd be nice to rework the text extraction so that the process
was more modular, and so that alternate processes could be include components from externally-developed
classes and libraries.  This way, PDFbox doesn't need to solve the general layout analysis
problem, but it would be easier to develop extensions that solve specific problems well.

For what it's worth, the way I currently approach it is to define a PdfTextFeatureExtractor
that extends PDFStreamEngine.  In particular, the new class overrides the showGlyph() method
to write a YAML file that contains detailed information for each rendered glyph.

>From there one can develop whatever one wants for layout extraction and all of the other
segmentation and classification tasks.  The core layout analysis techniques I chose for my
work are based on the paper "Two Geometric Algorithms for Layout Analysis", by Thomas Breuel.

Best regards,


From: psconceicao@outlook.com <psconceicao@outlook.com>
Sent: Wednesday, December 28, 2016 1:29:54 PM
To: users@pdfbox.apache.org
Subject: RE: Identify not visible characters - Overlapped characters

Hi Manuel,

I'm sorry for my mistake and many thanks for your help and attention.

The best tool that I know to extract text from a PDF ( I didn't test Monarch), maintaining
the correct layout, is inside a CAAT software: Caseware IDEA. However this software is very
expensive and does a lot of other things.

All the others tools that I tested (and I tested several) do wrong positioning analysis.

It will be good to develop a tool to produce similar results obtained with IDEA.

The work that you developed can help others to achieve that result.

-----Mensagem original-----
De: Manuel Aristarán [mailto:jazzido@jazzido.com] Em nome de Manuel Aristarán
Enviada: quarta-feira, 28 de dezembro de 2016 20:37
Para: users@pdfbox.apache.org
Assunto: Re: Identify not visible characters - Overlapped characters

Hi Paulo,

> On Dec 28, 2016, at 9:52 AM, psconceicao@outlook.com wrote:
> Unfortunately, Tabula uses a totally different approach (image
> analysis) [...]

Sorry for going (sort of) off-topic, but that's not correct. In fact, Tabula does not support
images. Thanks to PDFBox, it "mines" text and graphical elements, and uses a set of heuristics
that attempt reconstruct a tabular structure.

> Tabula also do incoherent analysis when a table is larger than one
> page, for that reason Tabula is far from being a good tool for text
> extraction with correct positioning.

We always welcome bug reports (and patches!) :) [1]


[1] https://github.com/tabulapdf/tabula-java/issues

Manuel Aristarán <manuel@jazzido.com>

To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message