pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Deal <dev...@gmail.com>
Subject Tika and PDFBox NonSequentialPDFParser class
Date Tue, 15 May 2012 18:14:11 GMT
Background
I have been able to successfully parse PDF Forms using the 1.7.0
SNAPSHOT  Specifically I was able to use the new
PDDocument.loadNonSeq() method (See
https://issues.apache.org/jira/browse/PDFBOX-1199 for details.) to
extract the form field data. This method
    public static PDDocument loadNonSeq( File file, RandomAccess
scratchFile ) throws IOException
    {
    	  NonSequentialPDFParser parser = new NonSequentialPDFParser(
file, scratchFile );
        parser.parse();
        return parser.getPDDocument();
    }
uses the recently added NonSequentialPDFParser class in PDFBox where
the constructor for the NonSequentialPDFParser is defined as:
		public NonSequentialPDFParser( File pdfFile, RandomAccess raBuf )
throws FileNotFoundException, IOException
This all works fine if PDFBox is used standalone however I am using
Apache Tika which calls PDFBox and it does not call loadNonSeq() so I
have been thinking how to extend the code.

Problem
Tika defines the Parse Interface method as follows:
public void parse(InputStream stream, ContentHandler handler,Metadata
metadata, ParseContext context)
                        throws IOException, SAXException, TikaException
so only the InputStream is available whereas PDDocument (and
subsequently the NonSequentialPDFParser class) are looking for the
File object.

While Tika may someday support PDF Forms I do not have the luxury of
waiting. I need to solve the problem now.
I would greatly appreciate the advice of more knowledgeable people on
how I should proceed.

Questions
How should I proceed with extending the code so that Tika will work
PDFBox 1.7.0?
It doesn't seem likely that Tika would change the interface to supply
the File to be parsed.
Do I need to extend PDFBox so that it puts the InputStream (provided
by Tika) into a temporary file for use by the NonSequentialPDFParser
class?

WWYD - What Would You Do?

Thanks!

Mime
View raw message