pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Deal <dev...@gmail.com>
Subject Re: Tika and PDFBox NonSequentialPDFParser class
Date Wed, 16 May 2012 16:30:54 GMT
Using the first suggestion from Jukka to change the PDF Parser, the
Tika 1.1 class org.apache.tika.parser.pdf.PDFParser was modified to
load the document as follows:
 Starting at line 100
      TemporaryResources tmp2 = new TemporaryResources();
       try {
            TikaInputStream tstream = TikaInputStream.get(stream, tmp2);
            File tsFile = tstream.getFile();
            RandomAccess scratchFile = new
RandomAccessFile(tmp.createTemporaryFile(), "rw");
            pdfDocument = PDDocument.loadNonSeq(tsFile, scratchFile);
            // PDFBox can process entirely in memory, or can use a temp file
            //  for unpacked / processed resources
            // Decide which to do based on if we're reading from a
file or not already
//            TikaInputStream tstream = TikaInputStream.cast(stream);
//            if (tstream != null && tstream.hasFile()) {
//               // File based, take that as a cue to use a temporary file
//               RandomAccess scratchFile = new
RandomAccessFile(tmp.createTemporaryFile(), "rw");
//               pdfDocument = PDDocument.load(new
CloseShieldInputStream(stream), scratchFile, true);
//            } else {
//               // Go for the normal, stream based in-memory parsing
//               pdfDocument = PDDocument.load(new
CloseShieldInputStream(stream), true);
//            }

Tika builds and passes all the unit tests using loadNonSeq()    :-)
Now I will move on to my own testing.

Thanks again to Jukka for pointing me in the right direction!

Best Regards,
Steve Deal
"...I will choose a path that's clear, I will choose free will" - Rush

Mime
View raw message