pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Deal <dev...@gmail.com>
Subject Re: Tika and PDFBox NonSequentialPDFParser class
Date Wed, 16 May 2012 12:43:04 GMT
Thank you very much for the quick response and pointing me in the
right direction.

Since I'm new to both Tika and PDFBox could you please provide
clarification on the changes you suggested?
When you say "adjust the PDF Parser" are you referring to the
org.apache.tika.parser.pdf.PDFParser class in the Tika project? (The
alternative being: org.apache.pdfbox.pdfparser.PDFParser)  I don't
really understand how AutoDetectParser in Tika works so it isn't clear
to me which class is being used (Tika or PDFBox). I would think that
Tika would use org.apache.tika.parser.pdf.PDFParser but I'm not sure.

Furthermore I was confused by your statement:
"Then, when calling the parser, simply pass a it a TikaInputStream
instance created based on a local file you have:
InputStream stream = TikaInputStream.get(file); "

Using your first suggestion to change the PDF Parser it seems that I
should modify the Tika class:  org.apache.tika.parser.pdf.PDFParser
to load the document as follows:
        try {
            TikaInputStream tstream = TikaInputStream.cast(stream);
            if (tstream != null && tstream.hasFile()) {
               // File based, take that as a cue to use a temporary file
               RandomAccess scratchFile = new
RandomAccessFile(tmp.createTemporaryFile(), "rw");
// original          pdfDocument = PDDocument.load(new
CloseShieldInputStream(stream), scratchFile, true);
               pdfDocument = PDDocument.loadNonSeq(tstream.getFile(),
scratchFile, true);
            } else {
               // Go for the normal, stream based in-memory parsing
// original          pdfDocument = PDDocument.load(new
CloseShieldInputStream(stream), true);
                pdfDocument = PDDocument.loadNonSeq(tstream.getFile(), true);

To expand on the context of problem, we're using Alfresco which uses
Tika and then PDFBox for PDF files.  When a file is imported into
Alfresco it uses Tika to extract metadata but we need it to parse the
form fields. Some PDF files will have forms and others will not so I
expect that we'll always be parsing in non-sequential mode to insure
that the form fields are parsed.

Thanks again for your advice.


View raw message