pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Lehmkuehler <andr...@lehmi.de>
Subject Re: How to define regions in PDFTextStripperByArea?
Date Sun, 04 May 2014 11:15:15 GMT

Am 02.05.2014 13:18, schrieb Qingchao Kong:
> Paul,
> I think I am aware the difference of
> "stripper.setSortByPosition(true)" and
> "stripper.setSortByPosition(false)". It is best explained when you try
> to extract a PDF who has multiple columns, e.g. two columns.
> When you have "stripper.setSortByPosition(false)", the extraction
> result is usually the reading procedure which is fine. But when you
> have "stripper.setSortByPosition(true)", PDFBox will extract text from
> top to bottom, ignoring the columns, which is not expected by me.
I'm afraid there is a misunderstanding. PDFBox can't extract text context 
sensitive. e.g. detecting columns, header or footer.

Just for clarification:

sortByPosition = false

PDFBox extracts the text following their appearance in the pdf. In most cases 
the text will be sorted ny default, but that must not be true for every pdf. 
Especially updated pdfs are not sorted anymore.

sortByPosition = true

PDFBox extracts the text and tries to sort it using the position o each 
character. This works fine for simple texts. It gets more complicated and may 
lead to a false result if one of the following is used:

- different text sizes in the same line
- different font sizes in the same line
- super/subscripts
- multicolumns
- ....

Andreas Lehmkühler

View raw message