pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Murray-Rust <pm...@cam.ac.uk>
Subject Re: How to define regions in PDFTextStripperByArea?
Date Sun, 04 May 2014 11:42:00 GMT
+1 Andreas

There can never be an automatic way of reassembling structured text from
arbitrary PDFs. In our https://bitbucket.org/petermr/svg2xml-dev/ project
we are trying to do this for English language scientific documents, and we
use a number of heuristics based on whitespace, font sizes, weights,
English dictionaries, etc.

To show the impossibility here is a chunk of "text"
dog 3
cat 5
rat 1

Most people would interpret that as a table, but only because there are
implicit signals (uppercase labels, central white space) and linguistic
coherence (all PETs seem to be animals, all AGEs seem to be numbers). But
imagine if it was in Cyrillic, CJK, etc. It might even be read vertically.

Other common problems include:

* rotated text (e.g. along the sides of the page)
* floating boxes  (e.g. a box surrounded by whitespace in the middle of
running text)
* S  P  A  C  E  S    F  O  R    E  F  F  E  C  T
* hyphens at line end (do you remove them? not in chemistry!)
* indentation or outdentation
* numbering (e.g. 1.2.3 at para start)

On Sun, May 4, 2014 at 12:15 PM, Andreas Lehmkuehler <andreas@lehmi.de>wrote:

> Hi,
> Am 02.05.2014 13:18, schrieb Qingchao Kong:
>  Paul,
>> I think I am aware the difference of
>> "stripper.setSortByPosition(true)" and
>> "stripper.setSortByPosition(false)". It is best explained when you try
>> to extract a PDF who has multiple columns, e.g. two columns.
>> When you have "stripper.setSortByPosition(false)", the extraction
>> result is usually the reading procedure which is fine. But when you
>> have "stripper.setSortByPosition(true)", PDFBox will extract text from
>> top to bottom, ignoring the columns, which is not expected by me.
> I'm afraid there is a misunderstanding. PDFBox can't extract text context
> sensitive. e.g. detecting columns, header or footer.
> Just for clarification:
> sortByPosition = false
> PDFBox extracts the text following their appearance in the pdf. In most
> cases the text will be sorted ny default, but that must not be true for
> every pdf. Especially updated pdfs are not sorted anymore.
> sortByPosition = true
> PDFBox extracts the text and tries to sort it using the position o each
> character. This works fine for simple texts. It gets more complicated and
> may lead to a false result if one of the following is used:
> - different text sizes in the same line
> - different font sizes in the same line
> - super/subscripts
> - multicolumns
> - ....
> BR
> Andreas Lehmkühler

Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message