pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thach Tran <tranngoctha...@gmail.com>
Subject Re: Extracting paper/book title from a PDF
Date Mon, 02 Feb 2009 19:08:33 GMT
Daniele Development-ML wrote:
> Hello everybody,
> I'm using PDFBox to try to extract some specific text from a PDF file. In
> particular, I'm trying to detect the book title, author, and the
> bibliographic entries (the references) - the PDF file is printed through the
> pdftex command.
> Extracting the raw text doesn't help too much as no data is carried with
> that. I was therefore trying to browser the document structure and access
> the COS objects and get the text value through them. This may just and only
> work for the title, and the authors - which both might be written in a
> different paragraph.
> However, I'm getting a bit confused on the real feasibility of this approach
> and on the use of the documentTreeStructure and the COSDictionary.
> Has anybody ever faced/solved this problem?
> Any comments or suggestions, or pointers to examples? The examples in the
> distro seem not to cover this aspect fully, or perhaps I am wrong.
> Many thanks,
> Dan
Hi Dan!
I wouldn't think you can extract title, author or any "specific" text, 
for that matter, from what the PDF actually display; and it does not 
suppose to be that way too. This is simply because the content of a page 
in PDF does not capture any information specifying whether a piece of 
text is a title, author, etc. As you said earlier, if I understand 
correctly, you want to get the text in the first paragraph for title and 
the text in next paragraph for author, this is also not very feasible 
since again, PDF doesn't not even have knowledge about paragraph.
For instance, for a title "My Title", in the content of the page, it may 
just say something like display "My Title" at point x,y.
Moreover, for PDF generated by pdftex, the situation is even worst. In 
order to achieve high quality typesetting, the way TeX/LaTeX typeset 
text is very complex. For example, you could find your title "My Title" 
is specified as following in the PDF's content:
display "M" at position x1, y1
display "y" at position x2, y2

Your best hope is try to get hold of PDDocumentInformation's object (by 
calling getDocumentInformation() on an PDDocument's object) which 
represented the Info dictionary in the trailer of the PDF file. This 
could contain the title and author of the PDF file and it's also the 
appropriate way to store such information in a PDF.
However, I would doubt that such information is included in the PDF you 
are working with since this sort of information is kinda "meta 
information" and does not display when viewing the file, so people don't 
really care to put that in when making the file.
Certainly in the case of pdftex, one has to use package hyperref and 
implicitly specifies the title and author with \hypersetup in order to 
produce an PDF with that "meta information".
Sorry for my lengthy explanation, just try to make it clear :-)


View raw message