pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Lehmkühler (JIRA) <j...@apache.org>
Subject [jira] [Commented] (PDFBOX-1104) Improves parsing speed of a pdf by an average of 45% when extracting text from one random page in the document.
Date Sat, 20 Aug 2011 05:51:27 GMT

    [ https://issues.apache.org/jira/browse/PDFBOX-1104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13088143#comment-13088143
] 

Andreas Lehmkühler commented on PDFBOX-1104:
--------------------------------------------

I didn't have a look a the sources but the description sounds like Adams approach to implement
a conforming parser PDFBOX-1000

> Improves parsing speed of a pdf by an average of 45% when extracting text from one random
page in the document.
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1104
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1104
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing, Utilities
>    Affects Versions: 1.6.0
>            Reporter: Jeremy Villalobos
>            Priority: Minor
>             Fix For: 1.6.0
>
>         Attachments: OnePagePDFTextStripper.java, PagesNotExpectedHere.java, ParseTester.java,
QuickParser.java, fast_parser.diff
>
>
> The parser proposed just parses the minimal required from the PDF file according to PDF
specifications.  A random page can be parsed without having to parse the entire document first.
 Exist parsing code was used to transfer existing bugfixes and compliance fixes to this parser.
> The parser has been tested with the text extraction tool.  But has not been tested with
the viewer or other pdf tools.  Some tools may need to be recoded to use the parser to prevent
null pointer exceptions since the COSDocument will contain null pointers for COSObjects that
have not been parsed.  For example, the Current Text Extractor assumes the entire document
is loaded.  On this code submission a modified text extractor is also included with the name
OnePagePDFTextStripper.  The class has a function that will extract the text from a PDPage
submitted by the programmer.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

Mime
View raw message