tika-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ensor, Neal" <Ens...@osti.gov>
Subject PDF and MS Word Metadata question: page counts
Date Thu, 22 Jul 2010 15:22:30 GMT
Just a curiousity:  I'm currently using tika 0.7 for some simple text extraction, and noticed
that for some reason I can't access page counts for either PDF or Word documents.

I know the information is available via underlying library calls (e.g., PDF box) and appears
it should be available via extended information in the MS Office parser, but I don't see it
in the metadata of any documents I tried.  My question is, was there some reason why page
counts are omitted?  I hacked my local copy of PDFParser to provide such via the PDDocument.getNumberOfPages()
call,  but was wondering if I missed something somewhere or there might be a reason to not
provide such information.  For the Word documents, I guess since it should be provided, guess
I'm out of luck there, but for my purposes, I'd like at least parsed PDF metadata to provide
that information if possible...  Thanks!

Neal Ensor
ensorn@osti.gov
Mime
View raw message