pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Murray-Rust <pm...@cam.ac.uk>
Subject Re: Identify not visible characters - Overlapped characters
Date Thu, 29 Dec 2016 09:26:03 GMT
Over the years we have developed PDF2SVG (
https://bitbucket.org/petermr/pdf2svg/overview) which is built on PDFBOX
1.8 and uses PageDrawer to capture the primitives (characters, paths,
images). It tries to carry out a faithful extraction without semantic loss
and there is effort on capturing styles, font-weights, and translating
non-standard characters to Unicode. This is particularly important for high
Unicode points such as mathematical symbols. There is emphasis on
downstream analysis (e.g. converting paths to SVG primitives such as
circles and rects). The main downstream emphasis is on scientific and
technical documents and translating the complete contents (text, diagrams,
images, tables, etc.) to semantic form.

You are welcome to try it and see whether it helps your problem. The
clipping paths are initially preserved, I think, but not output to the
final SVG

We haven't needed to change it for a year or so, but have wondered about
converting to PDFBOX2.0. Unfortunately as Manuel (hi!!) says this requires
considerable rewriting (and I'd be interested in knowing if anyone has
written the equivalent of capturing the PageDrawer output.

We are now being asked to process some documents in bulk and can process a
subset of tables, but we don't intend to duplicate complete Tabula


On Thu, Dec 29, 2016 at 6:22 AM, John Logan <John.Logan@texture.com> wrote:

> Hi Paulo,
> Is your layout analysis focused on extracting tabular data (records) from
> a PDF file?  Or are you trying to handle more general layouts?
> PDFBOX-2998 contains detailed discussion about enhancing the extraction
> algorithms, including adding advanced layout analysis.  The argument
> against this is that it's very hard to simultaneously achieve high quality
> and general applicability.
> The current text extractor allows a developer to override the text output
> methods, but the core is fairly monolithic.  It'd be nice to rework the
> text extraction so that the process was more modular, and so that alternate
> processes could be include components from externally-developed classes and
> libraries.  This way, PDFbox doesn't need to solve the general layout
> analysis problem, but it would be easier to develop extensions that solve
> specific problems well.
> For what it's worth, the way I currently approach it is to define a
> PdfTextFeatureExtractor that extends PDFStreamEngine.  In particular, the
> new class overrides the showGlyph() method to write a YAML file that
> contains detailed information for each rendered glyph.
> From there one can develop whatever one wants for layout extraction and
> all of the other segmentation and classification tasks.  The core layout
> analysis techniques I chose for my work are based on the paper "Two
> Geometric Algorithms for Layout Analysis", by Thomas Breuel.
> Best regards,
> John
> ________________________________
> From: psconceicao@outlook.com <psconceicao@outlook.com>
> Sent: Wednesday, December 28, 2016 1:29:54 PM
> To: users@pdfbox.apache.org
> Subject: RE: Identify not visible characters - Overlapped characters
> Hi Manuel,
> I'm sorry for my mistake and many thanks for your help and attention.
> The best tool that I know to extract text from a PDF ( I didn't test
> Monarch), maintaining the correct layout, is inside a CAAT software:
> Caseware IDEA. However this software is very expensive and does a lot of
> other things.
> All the others tools that I tested (and I tested several) do wrong
> positioning analysis.
> It will be good to develop a tool to produce similar results obtained with
> The work that you developed can help others to achieve that result.
> Paulo
> -----Mensagem original-----
> De: Manuel Aristarán [mailto:jazzido@jazzido.com] Em nome de Manuel
> Aristarán
> Enviada: quarta-feira, 28 de dezembro de 2016 20:37
> Para: users@pdfbox.apache.org
> Assunto: Re: Identify not visible characters - Overlapped characters
> Hi Paulo,
> > On Dec 28, 2016, at 9:52 AM, psconceicao@outlook.com wrote:
> >
> > Unfortunately, Tabula uses a totally different approach (image
> > analysis) [...]
> Sorry for going (sort of) off-topic, but that's not correct. In fact,
> Tabula does not support images. Thanks to PDFBox, it "mines" text and
> graphical elements, and uses a set of heuristics that attempt reconstruct a
> tabular structure.
> > Tabula also do incoherent analysis when a table is larger than one
> > page, for that reason Tabula is far from being a good tool for text
> > extraction with correct positioning.
> We always welcome bug reports (and patches!) :) [1]
> Thanks!
> [1] https://github.com/tabulapdf/tabula-java/issues
> —
> Manuel Aristarán <manuel@jazzido.com>
> http://jazzido.com
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org

Peter Murray-Rust
Reader Emeritus in Molecular Informatics
Unilever Centre, Dept. Of Chemistry
University of Cambridge

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message