pdfbox-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From msahy...@apache.org
Subject svn commit: r1879397 - /pdfbox/branches/issue45/pdfbox/src/main/java/org/apache/pdfbox/text/PDFTextStripper.java
Date Wed, 01 Jul 2020 10:17:59 GMT
Author: msahyoun
Date: Wed Jul  1 10:17:59 2020
New Revision: 1879397

URL: http://svn.apache.org/viewvc?rev=1879397&view=rev
PDFBOX-4904: add hint to javadoc to setSortByPosition as suggested by Ronald Bergmann


Modified: pdfbox/branches/issue45/pdfbox/src/main/java/org/apache/pdfbox/text/PDFTextStripper.java
URL: http://svn.apache.org/viewvc/pdfbox/branches/issue45/pdfbox/src/main/java/org/apache/pdfbox/text/PDFTextStripper.java?rev=1879397&r1=1879396&r2=1879397&view=diff
--- pdfbox/branches/issue45/pdfbox/src/main/java/org/apache/pdfbox/text/PDFTextStripper.java
+++ pdfbox/branches/issue45/pdfbox/src/main/java/org/apache/pdfbox/text/PDFTextStripper.java
Wed Jul  1 10:17:59 2020
@@ -217,6 +217,11 @@ public class PDFTextStripper extends Leg
      * This will return the text of a document. See writeText. <br>
      * NOTE: The document must not be encrypted when coming into this method.
+     * 
+     * <p>IMPORTANT: By default, text extraction is done in the same sequence as the
text in the PDF page content stream.
+     * PDF is a graphic format, not a text format, and unlike HTML, it has no requirements
that text one on page
+     * be rendered in a certain order. The order is the one that was determined by the software
that created the
+     * PDF. To get text sorted from left to right and top to botton, use {@link #setSortByPosition(boolean)}.
      * @param doc The document to get the text from.
      * @return The text of the PDF document.

View raw message