pdfbox-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From msahy...@apache.org
Subject pdfbox-docs git commit: Site checkin for project Apache PDFBox Website
Date Sun, 11 Dec 2016 09:05:18 GMT
Repository: pdfbox-docs
Updated Branches:
  refs/heads/asf-site 6c0161b83 -> a1993c448


Site checkin for project Apache PDFBox Website


Project: http://git-wip-us.apache.org/repos/asf/pdfbox-docs/repo
Commit: http://git-wip-us.apache.org/repos/asf/pdfbox-docs/commit/a1993c44
Tree: http://git-wip-us.apache.org/repos/asf/pdfbox-docs/tree/a1993c44
Diff: http://git-wip-us.apache.org/repos/asf/pdfbox-docs/diff/a1993c44

Branch: refs/heads/asf-site
Commit: a1993c4484ce3367fffb9f05466be4ff7a30ef5c
Parents: 6c0161b
Author: Maruan Sahyoun <sahyoun@fileaffairs.de>
Authored: Sun Dec 11 10:05:15 2016 +0100
Committer: Maruan Sahyoun <sahyoun@fileaffairs.de>
Committed: Sun Dec 11 10:05:15 2016 +0100

----------------------------------------------------------------------
 content/2.0/faq.html | 23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/pdfbox-docs/blob/a1993c44/content/2.0/faq.html
----------------------------------------------------------------------
diff --git a/content/2.0/faq.html b/content/2.0/faq.html
index c12d217..2c50285 100644
--- a/content/2.0/faq.html
+++ b/content/2.0/faq.html
@@ -156,6 +156,7 @@
 <h3 id="text-extraction">Text Extraction</h3>
 
 <ul>
+  <li><a href="#textorder">Why does the extracted text appear in the wrong sequence?</a></li>
   <li><a href="#notext">How come I am not getting any text from the PDF document?</a></li>
   <li><a href="#gibberish">How come I am getting gibberish(G38G43G36G51G5) when
extracting text?</a></li>
   <li><a href="#fontwidth">What does “java.io.IOException: Can’t handle font
width” mean?</a></li>
@@ -167,6 +168,7 @@
 
 <ul>
   <li><a href="#dropshadow">A drop shadow is missing or at the wrong position
when rendering a page</a></li>
+  <li><a href="#textantialias">Why are some texts in poor quality and not antialiased?</a></li>
 </ul>
 
 <h2 id="general-questions-1">General Questions</h2>
@@ -248,6 +250,15 @@ PDType0Font.load(), see also in the EmbeddedFonts.java example in the
source cod
 
 <h2 id="text-extraction-1">Text Extraction</h2>
 
+<p><a name="textorder"></a></p>
+
+<h3 id="why-does-the-extracted-text-appear-in-the-wrong-sequence">Why does the extracted
text appear in the wrong sequence?</h3>
+
+<p>By default, text extraction is done in the same sequence as the text in the PDF
page content stream.
+PDF is a graphic format, not a text format, and unlike HTML, it has no requirements that
text one on page
+be rendered in a certain order. The order is the one that was determined by the software
that created the PDF.
+To get text sorted from left to right and top to botton, use <code class="highlighter-rouge">setSortByPosition(true)</code>.</p>
+
 <p><a name="notext"></a></p>
 
 <h3 id="how-come-i-am-not-getting-any-text-from-the-pdf-document">How come I am not
getting any text from the PDF document?</h3>
@@ -311,7 +322,17 @@ the word “Hello” is drawn.</li>
 
 <h3 id="a-drop-shadow-is-missing-or-at-the-wrong-position-when-rendering-a-page">A
drop shadow is missing or at the wrong position when rendering a page</h3>
 
-<p>Please attach your file in the <a href="https://issues.apache.org/jira/browse/PDFBOX-3000">PDFBOX-3000</a>
issue</p>
+<p>Please attach your file in the <a href="https://issues.apache.org/jira/browse/PDFBOX-3000">PDFBOX-3000</a>
issue.</p>
+
+<p><a name="textantialias"></a></p>
+
+<h3 id="why-are-some-texts-in-poor-quality-and-not-antialiased">Why are some texts
in poor quality and not antialiased?</h3>
+
+<p>This is because in some PDFs (e.g. the one in PDFBOX-2814 <a href="https://issues.apache.org/jira/browse/PDFBOX-2814">https://issues.apache.org/jira/browse/PDFBOX-2814</a>),
text is not
+rendered directly, but as a shaped clipping from a background. Java graphics does not support
“soft clipping”
+<a href="https://bugs.openjdk.java.net/browse/JDK-4212743">https://bugs.openjdk.java.net/browse/JDK-4212743</a>,
and because of that, the edges are not looking smooth.
+Soft clipping could be achieved with some extra steps <a href="https://community.oracle.com/blogs/campbell/2006/07/19/java-2d-trickery-soft-clipping">https://community.oracle.com/blogs/campbell/2006/07/19/java-2d-trickery-soft-clipping</a>,
+but these would cost additional time and memory space. You can have a higher quality by rendering
at a higher dpi and then downscale the image.</p>
 
             </div>
         </div>


Mime
View raw message