pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Deal <dev...@gmail.com>
Subject Re: Tika and PDFBox NonSequentialPDFParser class
Date Fri, 18 May 2012 10:59:08 GMT
Followup on the change to the Tika 1.1 PDF Parser.
The org.apache.tika.parser.pdf.PDFParser class was modified to always
load the document using a temp file.
The changes are shown below.
--------------------------------------------------------------------------------------
            try {
               // New - Use a temp file so it can be parsed twice
                tstream = TikaInputStream.get(stream, tmp);
                tsFile = tstream.getFile();

                // PDFBox can process entirely in memory, or can use a temp file
                //  for unpacked / processed resources
                // Decide which to do based on if we're reading from a
file or not already
                if (tstream != null && tstream.hasFile()) {
                   // File based, take that as a cue to use a temporary file
                   scratchFile = new
RandomAccessFile(tmp.createTemporaryFile(), "rw");
                   pdfDocument = PDDocument.load(tsFile, scratchFile);
                } else {
                   // Go for the normal, stream based in-memory parsing
                   pdfDocument = PDDocument.load(tsFile);
                }
    ...snip code to cope with encrypted files...
                metadata.set(Metadata.CONTENT_TYPE, "application/pdf");
                extractMetadata(pdfDocument, metadata);

                // New - Now parse again but non-sequentially to
retrieve any form field data
                pdfFormDoc = PDDocument.loadNonSeq(tsFile,
scratchFile);
                extractFormFieldData(pdfFormDoc, metadata);

                PDF2XHTML.process(pdfDocument, handler, metadata,
                        extractAnnotationText, enableAutoSpace,
                        suppressDuplicateOverlappingText, sortByPosition);
-------------------------------------------------------------------------------------------------
Parsing the file twice doesn't seem to be the optimal solution since
and may create problems with large documents.  It seems that the long
term solution requires additional development work from both the Tika
and PDFBox teams.

What would be the best way to raise this common issue with the
appropriate developers on both teams?

Best Regards,
Steve

Mime
View raw message