pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Chojecki <i...@rayman2200.de>
Subject Re: PDFBox and PDF Portfolio Files
Date Mon, 23 Jan 2012 20:23:45 GMT
Am 23.01.2012 21:03, schrieb P. Hill:
> I am a developer of software which uses Lucene & Tika . We recently came
> across what to us is new file format a PDF Portfolio file.
> http://help.adobe.com/en_US/Acrobat/9.0/Standard/WSA2872EA8-9756-4a8c-9F20-8E93D59D91CE.html

A PDF Portfolio is nearly the same thing as a pdf with attachments. You 
can take a look at [1].

> I'm using Tika 1.0 and it doesn't understand the contents of the
> portfolio except in a most rudimentary sense. It finds a bit of summary
> text, but not the contents of the documents within the portfolio.
> When parsing a portfolio file in PDFBox, is it already supported or are
> there plans to support this format?

I think the content need to be extracted first. Maybe you will have some 
problems extracting attachments of type pdf. All other types of content 
would be great.

> -Paul

Best regards

[1] http://pdfbox.apache.org/userguide/file_references.html

View raw message