pdfbox-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From le...@apache.org
Subject svn commit: r1481535 - /pdfbox/cmssite/trunk/content/userguide/faq.mdtext
Date Sun, 12 May 2013 12:13:00 GMT
Author: lehmi
Date: Sun May 12 12:13:00 2013
New Revision: 1481535

URL: http://svn.apache.org/r1481535
added FAQ page

    pdfbox/cmssite/trunk/content/userguide/faq.mdtext   (with props)

Added: pdfbox/cmssite/trunk/content/userguide/faq.mdtext
URL: http://svn.apache.org/viewvc/pdfbox/cmssite/trunk/content/userguide/faq.mdtext?rev=1481535&view=auto
--- pdfbox/cmssite/trunk/content/userguide/faq.mdtext (added)
+++ pdfbox/cmssite/trunk/content/userguide/faq.mdtext Sun May 12 12:13:00 2013
@@ -0,0 +1,137 @@
+Title: Frequently asked Questions
+Notice:    Licensed to the Apache Software Foundation (ASF) under one
+           or more contributor license agreements.  See the NOTICE file
+           distributed with this work for additional information
+           regarding copyright ownership.  The ASF licenses this file
+           to you under the Apache License, Version 2.0 (the
+           "License"); you may not use this file except in compliance
+           with the License.  You may obtain a copy of the License at
+           .
+             http://www.apache.org/licenses/LICENSE-2.0
+           .
+           Unless required by applicable law or agreed to in writing,
+           software distributed under the License is distributed on an
+           KIND, either express or implied.  See the License for the
+           specific language governing permissions and limitations
+           under the License.
+# FAQ
+## General Questions
+ - [When will the next version of PDFBox be released?](#releaseplan)
+ - [I am getting the below Log4J warning message, how do I remove it?](#log4j)
+ - [Is PDFBox thread safe?](#threadsafe)
+ - [Why do I get a "Warning: You did not close the PDF Document"?](#notclosed)
+## Text Extraction
+ - [How come I am not getting any text from the PDF document?](#notext)
+ - [How come I am getting gibberish(G38G43G36G51G5) when extracting text?](#gibberish)
+ - [What does "java.io.IOException: Can't handle font width" mean?](#fontwidth)
+ - [Why do I get "You do not have permission to extract text" on some documents?](#permission)
+ - [Can't we just extract the text without parsing the whole document or extract text as
it is parsed?](#partially)
+# Answers
+## General Questions
+### When will the next version of PDFBox be released ### {#releaseplan}
+As fixes are made and integrated into the repository these changes are documented in the
+[release notes](http://pdfbox.apache.org/downloads.html). An estimate will be given of when
the next version will be released.
+Of course, this is only an estimate and could change.
+### I am getting the below Log4J warning message, how do I remove it? ### {#log4j}
+	:::java
+	log4j:WARN No appenders could be found for logger (org.apache.pdfbox.util.ResourceLoader).
+	log4j:WARN Please initialize the log4j system properly.
+This message means that you need to configure the log4j logging system.
+See the [log4j documentation](http://logging.apache.org/log4j/docs/documentation.html) for
more information.
+PDFBox comes with a sample log4j configuration file.  To use it you set a system property
like this
+	:::java
+        java -Dlog4j.configuration=log4j.xml org.apache.pdfbox.ExtractText <PDF-file>
+If this is not working for you then you may have to specify the log4j config file using a
URL path, like this:
+        :::java
+        log4j.configuration=file:///<path to config file>
+Please see [this](https://sourceforge.net/forum/forum.php?thread_id=1254229&amp;forum_id=267205)
forum thread 
+for more information.
+### Is PDFBox thread safe ### {#threadsafe}
+No! Only one thread may access a single document at a time. You can have multiple threads
+each accessing their own PDDocument object.
+### Why do I get a "Warning: You did not close the PDF Document"? ### {#notclosed}
+You need to call close() on the PDDocument inside the finally block, if you
+don't then the document will not be closed properly.  Also, you must close all
+PDDocument objects that get created.  The following code creates **two**
+PDDocument objects; one from the "new PDDocument()" and the second by the load method.
+        :::java
+	PDDocument doc = new PDDocument();
+	try
+	{
+	   doc = PDDocument.load( "my.pdf" );
+	}
+	finally
+	{
+	   if( doc != null )
+	   {
+	      doc.close();
+           }
+        }
+## Text Extraction
+### How come I am not getting any text from the PDF document? ### {#notext}
+Text extraction from a pdf document is a complicated task and there are many factors
+involved that effect the possibility and accuracy of text extraction.  It would be helpful
+to the PDFBox team if you could try a couple things.
+ - Open the PDF in Acrobat and try to extract text from there.  If Acrobat can extract text
then PDFBox 
+should be able to as well and it is a bug if it cannot.  If Acrobat cannot extract text then
PDFBox 'probably' cannot either.
+ - It might really be an image instead of text.  Some PDF documents are just images that
have been scanned in.
+You can tell by using the selection tool in Acrobat, if you can't select any text then it
is probably an image.
+### How come I am getting gibberish(G38G43G36G51G5) when extracting text? ### {#gibberish}
+This is because the characters in a PDF document can use a custom encoding
+instead of unicode or ASCII.  When you see gibberish text then it
+probably means that a meaningless internal encoding is being used.  The
+only way to access the text is to use OCR.  This may be a future
+### What does "java.io.IOException: Can't handle font width" mean? ### {#fontwidth}
+This probably means that the "Resources" directory is not in your classpath. The
+Resources directory is included in the PDFBox jar so this is only a problem if you
+are building PDFBox yourself and not using the binary.
+### Why do I get "You do not have permission to extract text" on some documents? ### {#permission}
+PDF documents have certain security permissions that can be applied to them and two 
+passwords associated with them, a user password and a master password. If the "cannot extract
+permission bit is set then you need to decrypt the document with the master password in order
+to extract the text.
+## Can't we just extract the text without parsing the whole document or extract text as it
is parsed. ### {#partially}
+Not really, for a couple reasons.
+ - If the document is encrypted then you need to parse at least until the encryption dictionary
+you can decrypt.
+ - Sometimes the PDFont contains vital information needed for text extraction.
+ - Text on a page does not have to be drawn in reading order. For example: if the page said
"Hello World",
+the pdf could have been written such that "World" gets drawn and then the cursor moves to
the left and 
+the word "Hello" is drawn.

Propchange: pdfbox/cmssite/trunk/content/userguide/faq.mdtext
    svn:eol-style = native

View raw message