pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: Detect headers of PDF
Date Thu, 28 Jul 2016 06:09:50 GMT
Am 28.07.2016 um 03:27 schrieb Qingchao Kong:
> Hi, I want to detect the headers of PDF docs.
> In my PDF files, I notice that, usually the headers of PDF and the
> main text body are separated  by a horizontal line. Is it possible to
> detect this "line" using Java code?

Yes but this is tricky, PDF does not have a <HEADER>. Have a look here:

This does something else, but the principle is the same: Analyze the 
content stream.

To understand what the PDF operators do, get the PDF 32000 specification
and go to the segment "operator summary".

If you're lucky, the line is really a line, i.e. operators m and l. If 
not lucky, it is a small image, or a rectangle.


> If this is possible, so I can get the rectangle of the main text area
> and remove the headers automatically using Java code.

To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

View raw message