Return-Path: X-Original-To: apmail-pdfbox-commits-archive@www.apache.org Delivered-To: apmail-pdfbox-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 212AD17AB7 for ; Tue, 31 Mar 2015 09:36:17 +0000 (UTC) Received: (qmail 87313 invoked by uid 500); 31 Mar 2015 09:35:55 -0000 Delivered-To: apmail-pdfbox-commits-archive@pdfbox.apache.org Received: (qmail 87250 invoked by uid 500); 31 Mar 2015 09:35:55 -0000 Mailing-List: contact commits-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@pdfbox.apache.org Delivered-To: mailing list commits@pdfbox.apache.org Received: (qmail 86675 invoked by uid 99); 31 Mar 2015 09:35:54 -0000 Received: from eris.apache.org (HELO hades.apache.org) (140.211.11.105) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 31 Mar 2015 09:35:54 +0000 Received: from hades.apache.org (localhost [127.0.0.1]) by hades.apache.org (ASF Mail Server at hades.apache.org) with ESMTP id A16BAAC0E29 for ; Tue, 31 Mar 2015 09:35:54 +0000 (UTC) Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: svn commit: r945783 [11/12] - in /websites/staging/pdfbox/trunk/content: ./ docs/2.0.0-SNAPSHOT/javadocs/org/apache/pdfbox/cos/ docs/2.0.0-SNAPSHOT/javadocs/org/apache/pdfbox/cos/class-use/ docs/2.0.0-SNAPSHOT/javadocs/org/apache/pdfbox/multipdf/ docs/... Date: Tue, 31 Mar 2015 09:35:53 -0000 To: commits@pdfbox.apache.org From: buildbot@apache.org X-Mailer: svnmailer-1.0.9 Message-Id: <20150331093554.A16BAAC0E29@hades.apache.org> Added: websites/staging/pdfbox/trunk/content/docs/2.0.0-SNAPSHOT/javadocs/org/apache/pdfbox/text/PDFTextStripper.html ============================================================================== --- websites/staging/pdfbox/trunk/content/docs/2.0.0-SNAPSHOT/javadocs/org/apache/pdfbox/text/PDFTextStripper.html (added) +++ websites/staging/pdfbox/trunk/content/docs/2.0.0-SNAPSHOT/javadocs/org/apache/pdfbox/text/PDFTextStripper.html Tue Mar 31 09:35:52 2015 @@ -0,0 +1,1760 @@ + + + + + + +PDFTextStripper (Apache PDFBox 2.0.0-SNAPSHOT API) + + + + + + + + + + + +
+
org.apache.pdfbox.text
+

Class PDFTextStripper

+
+
+ +
+
    +
  • +
    +
    Direct Known Subclasses:
    +
    PDFTextStripperByArea
    +
    +
    +
    +
    public class PDFTextStripper
    +extends PDFStreamEngine
    +
    This class will take a pdf document and strip out all of the text and ignore the + formatting and such. Please note; it is up to clients of this class to verify that + a specific user has the correct permissions to extract text from the PDF document. + + The basic flow of this process is that we get a document and use a series of + processXXX() functions that work on smaller and smaller chunks of the page. + Eventually, we fully process each page and then print it.
    +
    Author:
    +
    Ben Litchfield
    +
  • +
+
+
+ +
+
+
    +
  • + +
      +
    • + + +

      Field Detail

      + + + +
        +
      • +

        LINE_SEPARATOR

        +
        protected final String LINE_SEPARATOR
        +
        The platform's line separator.
        +
      • +
      + + + +
        +
      • +

        charactersByArticle

        +
        protected Vector<List<TextPosition>> charactersByArticle
        +
        The charactersByArticle is used to extract text by article divisions. For example + a PDF that has two columns like a newspaper, we want to extract the first column and + then the second column. In this example the PDF would have 2 beads(or articles), one for + each column. The size of the charactersByArticle would be 5, because not all text on the + screen will fall into one of the articles. The five divisions are shown below + + Text before first article + first article text + text between first article and second article + second article text + text after second article + + Most PDFs won't have any beads, so charactersByArticle will contain a single entry.
        +
      • +
      + + + + + + + +
        +
      • +

        output

        +
        protected Writer output
        +
      • +
      +
    • +
    + +
      +
    • + + +

      Constructor Detail

      + + + +
        +
      • +

        PDFTextStripper

        +
        public PDFTextStripper()
        +                throws IOException
        +
        Instantiate a new PDFTextStripper object.
        +
        Throws:
        +
        IOException - If there is an error loading the properties.
        +
      • +
      +
    • +
    + +
      +
    • + + +

      Method Detail

      + + + +
        +
      • +

        getText

        +
        public String getText(PDDocument doc)
        +               throws IOException
        +
        This will return the text of a document. See writeText.
        + NOTE: The document must not be encrypted when coming into this method.
        +
        Parameters:
        doc - The document to get the text from.
        +
        Returns:
        The text of the PDF document.
        +
        Throws:
        +
        IOException - if the doc state is invalid or it is encrypted.
        +
      • +
      + + + +
        +
      • +

        writeText

        +
        public void writeText(PDDocument doc,
        +             Writer outputStream)
        +               throws IOException
        +
        This will take a PDDocument and write the text of that document to the print writer.
        +
        Parameters:
        doc - The document to get the data from.
        outputStream - The location to put the text.
        +
        Throws:
        +
        IOException - If the doc is in an invalid state.
        +
      • +
      + + + +
        +
      • +

        processPages

        +
        protected void processPages(PDPageTree pages)
        +                     throws IOException
        +
        This will process all of the pages and the text that is in them.
        +
        Parameters:
        pages - The pages object in the document.
        +
        Throws:
        +
        IOException - If there is an error parsing the text.
        +
      • +
      + + + +
        +
      • +

        startDocument

        +
        protected void startDocument(PDDocument document)
        +                      throws IOException
        +
        This method is available for subclasses of this class. It will be called before processing + of the document start.
        +
        Parameters:
        document - The PDF document that is being processed.
        +
        Throws:
        +
        IOException - If an IO error occurs.
        +
      • +
      + + + +
        +
      • +

        endDocument

        +
        protected void endDocument(PDDocument document)
        +                    throws IOException
        +
        This method is available for subclasses of this class. It will be called after processing + of the document finishes.
        +
        Parameters:
        document - The PDF document that is being processed.
        +
        Throws:
        +
        IOException - If an IO error occurs.
        +
      • +
      + + + +
        +
      • +

        processPage

        +
        public void processPage(PDPage page)
        +                 throws IOException
        +
        This will process the contents of a page.
        +
        Parameters:
        page - The page to process.
        +
        Throws:
        +
        IOException - If there is an error processing the page.
        +
      • +
      + + + +
        +
      • +

        startArticle

        +
        protected void startArticle()
        +                     throws IOException
        +
        Start a new article, which is typically defined as a column + on a single page (also referred to as a bead). This assumes + that the primary direction of text is left to right. + Default implementation is to do nothing. Subclasses + may provide additional information.
        +
        Throws:
        +
        IOException - If there is any error writing to the stream.
        +
      • +
      + + + +
        +
      • +

        startArticle

        +
        protected void startArticle(boolean isLTR)
        +                     throws IOException
        +
        Start a new article, which is typically defined as a column + on a single page (also referred to as a bead). + Default implementation is to do nothing. Subclasses + may provide additional information.
        +
        Parameters:
        isLTR - true if primary direction of text is left to right.
        +
        Throws:
        +
        IOException - If there is any error writing to the stream.
        +
      • +
      + + + +
        +
      • +

        endArticle

        +
        protected void endArticle()
        +                   throws IOException
        +
        End an article. Default implementation is to do nothing. Subclasses + may provide additional information.
        +
        Throws:
        +
        IOException - If there is any error writing to the stream.
        +
      • +
      + + + +
        +
      • +

        startPage

        +
        protected void startPage(PDPage page)
        +                  throws IOException
        +
        Start a new page. Default implementation is to do nothing. Subclasses + may provide additional information.
        +
        Parameters:
        page - The page we are about to process.
        +
        Throws:
        +
        IOException - If there is any error writing to the stream.
        +
      • +
      + + + +
        +
      • +

        endPage

        +
        protected void endPage(PDPage page)
        +                throws IOException
        +
        End a page. Default implementation is to do nothing. Subclasses + may provide additional information.
        +
        Parameters:
        page - The page we are about to process.
        +
        Throws:
        +
        IOException - If there is any error writing to the stream.
        +
      • +
      + + + +
        +
      • +

        writePage

        +
        protected void writePage()
        +                  throws IOException
        +
        This will print the text of the processed page to "output". + It will estimate, based on the coordinates of the text, where + newlines and word spacings should be placed. The text will be + sorted only if that feature was enabled.
        +
        Throws:
        +
        IOException - If there is an error writing the text.
        +
      • +
      + + + +
        +
      • +

        writeLineSeparator

        +
        protected void writeLineSeparator()
        +                           throws IOException
        +
        Write the line separator value to the output stream.
        +
        Throws:
        +
        IOException - If there is a problem writing out the lineseparator to the document.
        +
      • +
      + + + +
        +
      • +

        writeWordSeparator

        +
        protected void writeWordSeparator()
        +                           throws IOException
        +
        Write the word separator value to the output stream.
        +
        Throws:
        +
        IOException - If there is a problem writing out the wordseparator to the document.
        +
      • +
      + + + +
        +
      • +

        writeCharacters

        +
        protected void writeCharacters(TextPosition text)
        +                        throws IOException
        +
        Write the string in TextPosition to the output stream.
        +
        Parameters:
        text - The text to write to the stream.
        +
        Throws:
        +
        IOException - If there is an error when writing the text.
        +
      • +
      + + + +
        +
      • +

        writeString

        +
        protected void writeString(String text,
        +               List<TextPosition> textPositions)
        +                    throws IOException
        +
        Write a Java string to the output stream. The default implementation will ignore the + textPositions and just calls writeString(String).
        +
        Parameters:
        text - The text to write to the stream.
        textPositions - The TextPositions belonging to the text.
        +
        Throws:
        +
        IOException - If there is an error when writing the text.
        +
      • +
      + + + +
        +
      • +

        writeString

        +
        protected void writeString(String text)
        +                    throws IOException
        +
        Write a Java string to the output stream.
        +
        Parameters:
        text - The text to write to the stream.
        +
        Throws:
        +
        IOException - If there is an error when writing the text.
        +
      • +
      + + + +
        +
      • +

        processTextPosition

        +
        protected void processTextPosition(TextPosition text)
        +
        This will process a TextPosition object and add the text to the list of characters on a page. + It takes care of overlapping text.
        +
        Parameters:
        text - The text to process.
        +
      • +
      + + + +
        +
      • +

        getStartPage

        +
        public int getStartPage()
        +
        This is the page that the text extraction will start on. The pages start + at page 1. For example in a 5 page PDF document, if the start page is 1 + then all pages will be extracted. If the start page is 4 then pages 4 and 5 + will be extracted. The default value is 1.
        +
        Returns:
        Value of property startPage.
        +
      • +
      + + + +
        +
      • +

        setStartPage

        +
        public void setStartPage(int startPageValue)
        +
        This will set the first page to be extracted by this class.
        +
        Parameters:
        startPageValue - New value of property startPage.
        +
      • +
      + + + +
        +
      • +

        getEndPage

        +
        public int getEndPage()
        +
        This will get the last page that will be extracted. This is inclusive, + for example if a 5 page PDF an endPage value of 5 would extract the + entire document, an end page of 2 would extract pages 1 and 2. This defaults + to Integer.MAX_VALUE such that all pages of the pdf will be extracted.
        +
        Returns:
        Value of property endPage.
        +
      • +
      + + + +
        +
      • +

        setEndPage

        +
        public void setEndPage(int endPageValue)
        +
        This will set the last page to be extracted by this class.
        +
        Parameters:
        endPageValue - New value of property endPage.
        +
      • +
      + + + +
        +
      • +

        setLineSeparator

        +
        public void setLineSeparator(String separator)
        +
        Set the desired line separator for output text. The line.separator + system property is used if the line separator preference is not set + explicitly using this method.
        +
        Parameters:
        separator - The desired line separator string.
        +
      • +
      + + + +
        +
      • +

        getLineSeparator

        +
        public String getLineSeparator()
        +
        This will get the line separator.
        +
        Returns:
        The desired line separator string.
        +
      • +
      + + + +
        +
      • +

        getWordSeparator

        +
        public String getWordSeparator()
        +
        This will get the word separator.
        +
        Returns:
        The desired word separator string.
        +
      • +
      + + + +
        +
      • +

        setWordSeparator

        +
        public void setWordSeparator(String separator)
        +
        Set the desired word separator for output text. The PDFBox text extraction + algorithm will output a space character if there is enough space between + two words. By default a space character is used. If you need and accurate + count of characters that are found in a PDF document then you might want to + set the word separator to the empty string.
        +
        Parameters:
        separator - The desired page separator string.
        +
      • +
      + + + +
        +
      • +

        getSuppressDuplicateOverlappingText

        +
        public boolean getSuppressDuplicateOverlappingText()
        +
        Returns:
        Returns the suppressDuplicateOverlappingText.
        +
      • +
      + + + +
        +
      • +

        getCurrentPageNo

        +
        protected int getCurrentPageNo()
        +
        Get the current page number that is being processed.
        +
        Returns:
        A 1 based number representing the current page.
        +
      • +
      + + + +
        +
      • +

        getOutput

        +
        protected Writer getOutput()
        +
        The output stream that is being written to.
        +
        Returns:
        The stream that output is being written to.
        +
      • +
      + + + +
        +
      • +

        getCharactersByArticle

        +
        protected List<List<TextPosition>> getCharactersByArticle()
        +
        Character strings are grouped by articles. It is quite common that there + will only be a single article. This returns a List that contains List objects, + the inner lists will contain TextPosition objects.
        +
        Returns:
        A double List of TextPositions for all text strings on the page.
        +
      • +
      + + + +
        +
      • +

        setSuppressDuplicateOverlappingText

        +
        public void setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingTextValue)
        +
        By default the text stripper will attempt to remove text that overlapps each other. + Word paints the same character several times in order to make it look bold. By setting + this to false all text will be extracted, which means that certain sections will be + duplicated, but better performance will be noticed.
        +
        Parameters:
        suppressDuplicateOverlappingTextValue - The suppressDuplicateOverlappingText to set.
        +
      • +
      + + + +
        +
      • +

        getSeparateByBeads

        +
        public boolean getSeparateByBeads()
        +
        This will tell if the text stripper should separate by beads.
        +
        Returns:
        If the text will be grouped by beads.
        +
      • +
      + + + +
        +
      • +

        setShouldSeparateByBeads

        +
        public void setShouldSeparateByBeads(boolean aShouldSeparateByBeads)
        +
        Set if the text stripper should group the text output by a list of beads. + The default value is true!
        +
        Parameters:
        aShouldSeparateByBeads - The new grouping of beads.
        +
      • +
      + + + +
        +
      • +

        getEndBookmark

        +
        public PDOutlineItem getEndBookmark()
        +
        Get the bookmark where text extraction should end, inclusive. Default is null.
        +
        Returns:
        The ending bookmark.
        +
      • +
      + + + +
        +
      • +

        setEndBookmark

        +
        public void setEndBookmark(PDOutlineItem aEndBookmark)
        +
        Set the bookmark where the text extraction should stop.
        +
        Parameters:
        aEndBookmark - The ending bookmark.
        +
      • +
      + + + +
        +
      • +

        getStartBookmark

        +
        public PDOutlineItem getStartBookmark()
        +
        Get the bookmark where text extraction should start, inclusive. Default is null.
        +
        Returns:
        The starting bookmark.
        +
      • +
      + + + +
        +
      • +

        setStartBookmark

        +
        public void setStartBookmark(PDOutlineItem aStartBookmark)
        +
        Set the bookmark where text extraction should start, inclusive.
        +
        Parameters:
        aStartBookmark - The starting bookmark.
        +
      • +
      + + + +
        +
      • +

        getAddMoreFormatting

        +
        public boolean getAddMoreFormatting()
        +
        This will tell if the text stripper should add some more text formatting.
        +
        Returns:
        true if some more text formatting will be added
        +
      • +
      + + + +
        +
      • +

        setAddMoreFormatting

        +
        public void setAddMoreFormatting(boolean newAddMoreFormatting)
        +
        There will some additional text formatting be added if addMoreFormatting + is set to true. Default is false.
        +
        Parameters:
        newAddMoreFormatting - Tell PDFBox to add some more text formatting
        +
      • +
      + + + +
        +
      • +

        getSortByPosition

        +
        public boolean getSortByPosition()
        +
        This will tell if the text stripper should sort the text tokens + before writing to the stream.
        +
        Returns:
        true If the text tokens will be sorted before being written.
        +
      • +
      + + + +
        +
      • +

        setSortByPosition

        +
        public void setSortByPosition(boolean newSortByPosition)
        +
        The order of the text tokens in a PDF file may not be in the same + as they appear visually on the screen. For example, a PDF writer may + write out all text by font, so all bold or larger text, then make a second + pass and write out the normal text.
        + The default is to not sort by position.
        +
        + A PDF writer could choose to write each character in a different order. By + default PDFBox does not sort the text tokens before processing them due to + performance reasons.
        +
        Parameters:
        newSortByPosition - Tell PDFBox to sort the text positions.
        +
      • +
      + + + +
        +
      • +

        getSpacingTolerance

        +
        public float getSpacingTolerance()
        +
        Get the current space width-based tolerance value that is being used + to estimate where spaces in text should be added. Note that the + default value for this has been determined from trial and error.
        +
        Returns:
        The current tolerance / scaling factor
        +
      • +
      + + + +
        +
      • +

        setSpacingTolerance

        +
        public void setSpacingTolerance(float spacingToleranceValue)
        +
        Set the space width-based tolerance value that is used + to estimate where spaces in text should be added. Note that the + default value for this has been determined from trial and error. + Setting this value larger will reduce the number of spaces added.
        +
        Parameters:
        spacingToleranceValue - tolerance / scaling factor to use
        +
      • +
      + + + +
        +
      • +

        getAverageCharTolerance

        +
        public float getAverageCharTolerance()
        +
        Get the current character width-based tolerance value that is being used + to estimate where spaces in text should be added. Note that the + default value for this has been determined from trial and error.
        +
        Returns:
        The current tolerance / scaling factor
        +
      • +
      + + + +
        +
      • +

        setAverageCharTolerance

        +
        public void setAverageCharTolerance(float averageCharToleranceValue)
        +
        Set the character width-based tolerance value that is used + to estimate where spaces in text should be added. Note that the + default value for this has been determined from trial and error. + Setting this value larger will reduce the number of spaces added.
        +
        Parameters:
        averageCharToleranceValue - average tolerance / scaling factor to use
        +
      • +
      + + + +
        +
      • +

        getIndentThreshold

        +
        public float getIndentThreshold()
        +
        returns the multiple of whitespace character widths + for the current text which the current + line start can be indented from the previous line start + beyond which the current line start is considered + to be a paragraph start.
        +
        Returns:
        the number of whitespace character widths to use + when detecting paragraph indents.
        +
      • +
      + + + +
        +
      • +

        setIndentThreshold

        +
        public void setIndentThreshold(float indentThresholdValue)
        +
        sets the multiple of whitespace character widths + for the current text which the current [... 387 lines stripped ...]