pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniele Development-ML <daniele....@googlemail.com>
Subject Extracting paper/book title from a PDF
Date Mon, 02 Feb 2009 17:56:36 GMT
Hello everybody,
I'm using PDFBox to try to extract some specific text from a PDF file. In
particular, I'm trying to detect the book title, author, and the
bibliographic entries (the references) - the PDF file is printed through the
pdftex command.

Extracting the raw text doesn't help too much as no data is carried with
that. I was therefore trying to browser the document structure and access
the COS objects and get the text value through them. This may just and only
work for the title, and the authors - which both might be written in a
different paragraph.

However, I'm getting a bit confused on the real feasibility of this approach
and on the use of the documentTreeStructure and the COSDictionary.

Has anybody ever faced/solved this problem?
Any comments or suggestions, or pointers to examples? The examples in the
distro seem not to cover this aspect fully, or perhaps I am wrong.

Many thanks,


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message