pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject per page processing?
Date Wed, 15 Jul 2015 11:52:04 GMT
All,
  Raymond Wu recently opened TIKA-1679 and recommended that we switch to per-page processing
so that if there's an exception on one page, we'll still be able to extract contents from
other pages.

  The proposed fix is along these lines:

             int nop = document.getNumberOfPages();
            for(int i=1;i<=nop;i++) {
                PDF2XHTML pdf2XHTML = new PDF2XHTML(handler, metadata,
                extractAnnotationText, enableAutoSpace,
                suppressDuplicateOverlappingText, sortByPosition);
                try {
                    pdf2XHTML.setStartPage(i);
                    pdf2XHTML.setEndPage(i);
                    pdf2XHTML.writeText(document, dummyWriter);
                } catch(Exception e) {
                    // TODO ...
                }

  Does this seem reasonable?  Any gut reaction/estimates on the performance hit?  Perhaps
we should make this mode configurable?

Thank you.

             Best,

                        Tim

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message