pdfbox-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From msahy...@apache.org
Subject pdfbox-docs git commit: PDFBOX-3330: add FAQ to 2.0; add additional questions
Date Sun, 02 Oct 2016 15:36:05 GMT
Repository: pdfbox-docs
Updated Branches:
  refs/heads/master 493c4026d -> 03016d0b8


PDFBOX-3330: add FAQ to 2.0; add additional questions


Project: http://git-wip-us.apache.org/repos/asf/pdfbox-docs/repo
Commit: http://git-wip-us.apache.org/repos/asf/pdfbox-docs/commit/03016d0b
Tree: http://git-wip-us.apache.org/repos/asf/pdfbox-docs/tree/03016d0b
Diff: http://git-wip-us.apache.org/repos/asf/pdfbox-docs/diff/03016d0b

Branch: refs/heads/master
Commit: 03016d0b823bf0a8833911ec79a48688221acf39
Parents: 493c402
Author: Maruan Sahyoun <sahyoun@fileaffairs.de>
Authored: Sun Oct 2 17:35:56 2016 +0200
Committer: Maruan Sahyoun <sahyoun@fileaffairs.de>
Committed: Sun Oct 2 17:35:56 2016 +0200

----------------------------------------------------------------------
 content/2.0/faq.md            | 189 +++++++++++++++++++++++++++++++++++++
 content/_layouts/default.html |   1 +
 2 files changed, 190 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/pdfbox-docs/blob/03016d0b/content/2.0/faq.md
----------------------------------------------------------------------
diff --git a/content/2.0/faq.md b/content/2.0/faq.md
new file mode 100644
index 0000000..d42750f
--- /dev/null
+++ b/content/2.0/faq.md
@@ -0,0 +1,189 @@
+---
+license: Licensed to the Apache Software Foundation (ASF) under one
+         or more contributor license agreements.  See the NOTICE file
+         distributed with this work for additional information
+         regarding copyright ownership.  The ASF licenses this file
+         to you under the Apache License, Version 2.0 (the
+         "License"); you may not use this file except in compliance
+         with the License.  You may obtain a copy of the License at
+
+           http://www.apache.org/licenses/LICENSE-2.0
+
+         Unless required by applicable law or agreed to in writing,
+         software distributed under the License is distributed on an
+         "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+         KIND, either express or implied.  See the License for the
+         specific language governing permissions and limitations
+         under the License.
+
+layout:  default
+title:   Frequently Asked Questions (FAQ)
+---
+
+# Frequently asked questions
+
+### General Questions
+
+ - [I am getting the below Log4J warning message, how do I remove it?](#log4j)
+ - [Is PDFBox thread safe?](#threadsafe)
+ - [Why do I get a "Warning: You did not close the PDF Document"?](#notclosed)
+
+### Font Handling
+
+ - [I'm getting java.lang.IllegalArgumentException: ... is not available in this font's encoding:
WinAnsiEncoding](#fontencoding)
+
+### PDF Creation
+
+ - [I'm creating a PDF but my page is empty. Why?](#emptypage)
+
+### Text Extraction
+
+ - [How come I am not getting any text from the PDF document?](#notext)
+ - [How come I am getting gibberish(G38G43G36G51G5) when extracting text?](#gibberish)
+ - [What does "java.io.IOException: Can't handle font width" mean?](#fontwidth)
+ - [Why do I get "You do not have permission to extract text" on some documents?](#permission)
+ - [Can't we just extract the text without parsing the whole document or extract text as
it is parsed?](#partially)
+
+### PDF rendering
+
+ - [A drop shadow is missing or at the wrong position when rendering a page](#dropshadow)
 
+
+## General Questions
+
+<a name="log4j"></a>
+
+### I am getting the below Log4J warning message, how do I remove it? ###
+
+```java
+log4j:WARN No appenders could be found for logger (org.apache.pdfbox.util.ResourceLoader).
+log4j:WARN Please initialize the log4j system properly.
+```
+
+This message means that you need to configure the log4j logging system.
+See the [log4j documentation](http://logging.apache.org/log4j/1.2/manual.html) for more information.
+
+PDFBox comes with a sample log4j configuration file.  To use it you set a system property
like this
+
+```java
+java -Dlog4j.configuration=log4j.xml org.apache.pdfbox.ExtractText <PDF-file> <output-text-file>
+```
+
+If this is not working for you then you may have to specify the log4j config file using a
URL path, like this:
+
+```java
+log4j.configuration=file:///<path to config file>
+```
+
+<a name="threadsafe"></a>
+
+### Is PDFBox thread safe? ###
+
+PDFBox has experimental support for *read-only* operations on the same PDDocument from different
threads.
+
+For all other uses cases only one thread may access a single document at a time. You can
have multiple threads
+each accessing their own PDDocument object.
+
+<a name="notclosed"></a>
+
+### Why do I get a "Warning: You did not close the PDF Document"? ###
+
+You need to call close() on the PDDocument inside the finally block, if you
+don't then the document will not be closed properly.  Also, you must close all
+PDDocument objects that get created.  The following code creates **two**
+PDDocument objects; one from the "new PDDocument()" and the second by the load method.
+
+```java
+PDDocument doc = new PDDocument();
+try
+{
+   doc = PDDocument.load( "my.pdf" );
+}
+finally
+{
+   if( doc != null )
+   {
+      doc.close();
+   }
+}
+```
+
+## Font Handling
+
+<a name="fontencoding"></a>
+
+### I'm getting java.lang.IllegalArgumentException: ... is not available in this font's encoding:
WinAnsiEncoding
+
+Check whether the character is available in WinAnsiEncoding by looking at the [PDF Specification](https://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf)
Appendix D.
+If not, but if it is available in this font (in windows, have a look with charmap.exe), then
load the font with
+PDType0Font.load(), see also in the EmbeddedFonts.java example in the source code download.
+
+## PDF Creation
+
+<a name="emptypage"></a>
+
+### I'm creating a PDF but my page is empty. Why?
+
+Make sure that you closed your content stream before saving.
+
+## Text Extraction
+
+<a name="notext"></a>
+
+### How come I am not getting any text from the PDF document? ###
+
+Text extraction from a pdf document is a complicated task and there are many factors
+involved that effect the possibility and accuracy of text extraction.  It would be helpful
+to the PDFBox team if you could try a couple things.
+
+ - Open the PDF in Acrobat and try to extract text from there.  If Acrobat can extract text
then PDFBox
+should be able to as well and it is a bug if it cannot.  If Acrobat cannot extract text then
PDFBox 'probably' cannot either.
+ - It might really be an image instead of text.  Some PDF documents are just images that
have been scanned in.
+You can tell by using the selection tool in Acrobat, if you can't select any text then it
is probably an image.
+
+<a name="gibberish"></a>
+
+### How come I am getting gibberish(G38G43G36G51G5) when extracting text? ###
+
+This is because the characters in a PDF document can use a custom encoding
+instead of unicode or ASCII.  When you see gibberish text then it
+probably means that a meaningless internal encoding is being used.  The
+only way to access the text is to use OCR.  This may be a future
+enhancement.
+
+<a name="fontwidth"></a>
+
+### What does "java.io.IOException: Can't handle font width" mean? ###
+
+This probably means that the "Resources" directory is not in your classpath. The
+Resources directory is included in the PDFBox jar so this is only a problem if you
+are building PDFBox yourself and not using the binary.
+
+<a name="permission"></a>
+
+### Why do I get "You do not have permission to extract text" on some documents? ###
+
+PDF documents have certain security permissions that can be applied to them and two
+passwords associated with them, a user password and a master password. If the "cannot extract
text"
+permission bit is set then you need to decrypt the document with the master password in order
+to extract the text.
+
+<a name="partially"></a>
+
+### Can't we just extract the text without parsing the whole document or extract text as
it is parsed? ###
+
+Not really, for a couple reasons.
+
+ - If the document is encrypted then you need to parse at least until the encryption dictionary
before
+you can decrypt.
+ - Sometimes the PDFont contains vital information needed for text extraction.
+ - Text on a page does not have to be drawn in reading order. For example: if the page said
"Hello World",
+the pdf could have been written such that "World" gets drawn and then the cursor moves to
the left and
+the word "Hello" is drawn.
+
+## PDF rendering
+
+<a name="dropshadow"></a>
+
+### A drop shadow is missing or at the wrong position when rendering a page
+
+Please attach your file in the [PDFBOX-3000](https://issues.apache.org/jira/browse/PDFBOX-3000)
issue

http://git-wip-us.apache.org/repos/asf/pdfbox-docs/blob/03016d0b/content/_layouts/default.html
----------------------------------------------------------------------
diff --git a/content/_layouts/default.html b/content/_layouts/default.html
index 7d168c3..74bd641 100644
--- a/content/_layouts/default.html
+++ b/content/_layouts/default.html
@@ -90,6 +90,7 @@
                                 </ul>
                             </li>
                             <li><a href="/2.0/commandline.html">Command Line
Tools</a></li>
+                            <li><a href="/2.0/faq.html">FAQ</a></li>
                             <li><a href="/docs/2.0.3/javadocs/">API Docs</a></li>
                         </ul>
                     </li>


Mime
View raw message