pdfbox-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From build...@apache.org
Subject svn commit: r861708 - in /websites/staging/pdfbox/trunk/content: ./ userguide/ userguide/faq.html
Date Sun, 12 May 2013 12:13:05 GMT
Author: buildbot
Date: Sun May 12 12:13:04 2013
New Revision: 861708

Log:
Staging update by buildbot for pdfbox

Added:
    websites/staging/pdfbox/trunk/content/userguide/
    websites/staging/pdfbox/trunk/content/userguide/faq.html
Modified:
    websites/staging/pdfbox/trunk/content/   (props changed)

Propchange: websites/staging/pdfbox/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Sun May 12 12:13:04 2013
@@ -1 +1 @@
-1481522
+1481535

Added: websites/staging/pdfbox/trunk/content/userguide/faq.html
==============================================================================
--- websites/staging/pdfbox/trunk/content/userguide/faq.html (added)
+++ websites/staging/pdfbox/trunk/content/userguide/faq.html Sun May 12 12:13:04 2013
@@ -0,0 +1,265 @@
+<!DOCTYPE html>
+<html lang="en">
+    
+    <!--
+     
+     Licensed to the Apache Software Foundation (ASF) under one or more
+     contributor license agreements.  See the NOTICE file distributed with
+     this work for additional information regarding copyright ownership.
+     The ASF licenses this file to You under the Apache License, Version 2.0
+     (the "License"); you may not use this file except in compliance with
+     the License.  You may obtain a copy of the License at
+     
+     http://www.apache.org/licenses/LICENSE- 2.0
+     
+     Unless required by applicable law or agreed to in writing, software
+     distributed under the License is distributed on an "AS IS" BASIS,
+     WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+     See the License for the specific language governing permissions and
+     limitations under the License.
+     -->
+    
+  <head>
+    <title>Apache PDFBox | Frequently asked Questions</title>
+
+    <meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
+
+    <link href="/bootstrap/css/bootstrap.css" rel="stylesheet">
+    <link href="/bootstrap/css/bootstrap-responsive.css" rel="stylesheet">
+    <link href="/FontAwesome/css/font-awesome.css" rel="stylesheet">
+    <link href="/Iconic/iconic fill/iconic_fill.css" rel="stylesheet">
+    <link href="/css/pygments-github.css" rel="stylesheet">
+    <link href="/css/site.css" rel="stylesheet">
+        
+        
+
+    
+
+    
+    <!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor
license agreements.  See the NOTICE file distributed with this work for additional information
regarding copyright ownership.  The ASF licenses this file to you under the Apache License,
Version 2.0 (the &quot;License&quot;); you may not use this file except in compliance
with the License.  You may obtain a copy of the License at . http://www.apache.org/licenses/LICENSE-2.0
. Unless required by applicable law or agreed to in writing, software distributed under the
License is distributed on an &quot;AS IS&quot; BASIS, WITHOUT WARRANTIES OR CONDITIONS
OF ANY KIND, either express or implied.  See the License for the specific language governing
permissions and limitations under the License. -->
+        <!-- Twitter Bootstrap and jQuery after this line. -->
+        <script src="http://code.jquery.com/jquery-latest.js"></script>
+        <script src="/bootstrap/js/bootstrap.js"></script>
+        <script>
+            $('.nav-collapse').collapse();
+        </script>
+  </head>
+  <body>
+
+    <div class="navbar navbar-fixed-top">
+      <div class="navbar-inner">
+          <a href="index.html"><img class="logo" src="/images/logo-head.gif"></a>
+      </div>
+    </div>
+
+    <header class="main" id="overview">
+        <div class="container">
+        </div>
+    </header>
+
+    <div class="container-fluid">
+        <div class="row-fluid">
+            <div class="span3">
+                <ul class="nav nav-list">
+                    <li class="nav-header">Apache PDFBox</li>
+                    <li><a href="/downloads.html">
+                        <i class="icon-chevron-right"></i>
+                    Downloads</a></li>
+                    <li><a href="/dependencies.html">
+                        <i class="icon-chevron-right"></i>
+                    Dependencies</a></li>
+                    <li><a href="/references.html">
+                        <i class="icon-chevron-right"></i>
+                        References</a></li>
+                <li class="nav-header">Community</li>
+                <li><a href="/support.html">
+                    <i class="icon-chevron-right"></i>
+                    Support
+                </a></li>
+                <li><a href="/mailinglists.html">
+                    <i class="icon-chevron-right"></i>
+                    Mailing Lists
+                </a></li>
+                <li><a href="/team.html">
+                    <i class="icon-chevron-right"></i>
+                    Project Team</a></li>
+                <li  class="nav-header">Documentation</li>
+                <li><a href="/architecture.html">
+                    <i class="icon-chevron-right"></i>
+                    Architecture</a></li>
+                <li><a href="/commandline/">
+                    <i class="icon-chevron-right"></i>
+                    Command Line Tools</a></li>
+                <li class="dropdown"><a  class="dropdown-toggle" data-toggle="dropdown"
href="#">
+                    <i class="icon-chevron-right"></i>
+                    PDFBox Cookbook <b class="caret"></b></a>
+                    <ul class="dropdown-menu">
+                        <li><a href="/cookbook/documentcreation.html">
+                            <i class="icon-chevron-right"></i>
+                            Document Creation</a>
+                        </li>
+                        <li><a href="/cookbook/textextraction.html">
+                            <i class="icon-chevron-right"></i>
+                            Text Extraction</a>
+                        </li>
+                        <li><a href="/cookbook/pdfavalidation.html">
+                            <i class="icon-chevron-right"></i>
+                            PDF/A Validation</a>
+                        </li>
+                        <li><a href="/cookbook/workingwithfonts.html">
+                            <i class="icon-chevron-right"></i>
+                            Working with Fonts</a>
+                        </li>
+                        <li><a href="/cookbook/workingwithmetadata.html">
+                            <i class="icon-chevron-right"></i>
+                            Working with Metadata</a>
+                        </li>
+                        <li><a href="/cookbook/workingwithattachments.html">
+                            <i class="icon-chevron-right"></i>
+                            Working with Attachments</a>
+                        </li>
+                    </ul>
+                </li>
+                <li  class="nav-header">For Developers</li>
+                <li><a href="/building.html">
+                    <i class="icon-chevron-right"></i>
+                    Building PDFBox</a></li>
+                <li><a href="/ideas.html">
+                    <i class="icon-chevron-right"></i>
+                    Ideas</a></li>
+                <li><a href="/codingconventions.html">
+                    <i class="icon-chevron-right"></i>
+                    Coding Conventions</a></li>
+                <li  class="nav-header">Apache Software Foundation</li>
+                <li><a href="http://www.apache.org/">
+                    <i class="icon-chevron-right"></i>
+                    Apache Software Foundation</a></li>
+                <li><a href="http://www.apache.org/foundation/thanks.html">
+                    <i class="icon-chevron-right"></i>
+                    ASF Sponsors</a></li>
+                <li><a href="http://www.apache.org/security/">
+                    <i class="icon-chevron-right"></i>
+                    Security</a></li>
+                </ul>
+            </div>
+            <div class="span9">
+                 <h1 id="faq">FAQ</h1>
+<h2 id="general-questions">General Questions</h2>
+<ul>
+<li><a href="#releaseplan">When will the next version of PDFBox be released?</a></li>
+<li><a href="#log4j">I am getting the below Log4J warning message, how do I remove
it?</a></li>
+<li><a href="#threadsafe">Is PDFBox thread safe?</a></li>
+<li><a href="#notclosed">Why do I get a "Warning: You did not close the PDF Document"?</a></li>
+</ul>
+<h2 id="text-extraction">Text Extraction</h2>
+<ul>
+<li><a href="#notext">How come I am not getting any text from the PDF document?</a></li>
+<li><a href="#gibberish">How come I am getting gibberish(G38G43G36G51G5) when
extracting text?</a></li>
+<li><a href="#fontwidth">What does "java.io.IOException: Can't handle font width"
mean?</a></li>
+<li><a href="#permission">Why do I get "You do not have permission to extract
text" on some documents?</a></li>
+<li><a href="#partially">Can't we just extract the text without parsing the whole
document or extract text as it is parsed?</a></li>
+</ul>
+<h1 id="answers">Answers</h1>
+<h2 id="general-questions_1">General Questions</h2>
+<h3 id="releaseplan">When will the next version of PDFBox be released</h3>
+<p>As fixes are made and integrated into the repository these changes are documented
in the
+<a href="http://pdfbox.apache.org/downloads.html">release notes</a>. An estimate
will be given of when the next version will be released.
+Of course, this is only an estimate and could change.</p>
+<h3 id="log4j">I am getting the below Log4J warning message, how do I remove it?</h3>
+<div class="codehilite"><pre><span class="nl">log4j:</span><span
class="n">WARN</span> <span class="n">No</span> <span class="n">appenders</span>
<span class="n">could</span> <span class="n">be</span> <span class="n">found</span>
<span class="k">for</span> <span class="n">logger</span> <span
class="o">(</span><span class="n">org</span><span class="o">.</span><span
class="na">apache</span><span class="o">.</span><span class="na">pdfbox</span><span
class="o">.</span><span class="na">util</span><span class="o">.</span><span
class="na">ResourceLoader</span><span class="o">).</span>
+<span class="nl">log4j:</span><span class="n">WARN</span> <span
class="n">Please</span> <span class="n">initialize</span> <span class="n">the</span>
<span class="n">log4j</span> <span class="n">system</span> <span
class="n">properly</span><span class="o">.</span>
+</pre></div>
+
+
+<p>This message means that you need to configure the log4j logging system.
+See the <a href="http://logging.apache.org/log4j/docs/documentation.html">log4j documentation</a>
for more information.</p>
+<p>PDFBox comes with a sample log4j configuration file.  To use it you set a system
property like this</p>
+<div class="codehilite"><pre>    <span class="n">java</span> <span
class="o">-</span><span class="n">Dlog4j</span><span class="o">.</span><span
class="na">configuration</span><span class="o">=</span><span class="n">log4j</span><span
class="o">.</span><span class="na">xml</span> <span class="n">org</span><span
class="o">.</span><span class="na">apache</span><span class="o">.</span><span
class="na">pdfbox</span><span class="o">.</span><span class="na">ExtractText</span>
<span class="o">&lt;</span><span class="n">PDF</span><span
class="o">-</span><span class="n">file</span><span class="o">&gt;</span>
<span class="o">&lt;</span><span class="n">output</span><span
class="o">-</span><span class="n">text</span><span class="o">-</span><span
class="n">file</span><span class="o">&gt;</span>
+</pre></div>
+
+
+<p>If this is not working for you then you may have to specify the log4j config file
using a URL path, like this:</p>
+<div class="codehilite"><pre>    <span class="n">log4j</span><span
class="o">.</span><span class="na">configuration</span><span class="o">=</span><span
class="nl">file:</span><span class="c1">///&lt;path to config file&gt;</span>
+</pre></div>
+
+
+<p>Please see <a href="https://sourceforge.net/forum/forum.php?thread_id=1254229&amp;amp;forum_id=267205">this</a>
forum thread 
+for more information.</p>
+<h3 id="threadsafe">Is PDFBox thread safe</h3>
+<p>No! Only one thread may access a single document at a time. You can have multiple
threads
+each accessing their own PDDocument object.</p>
+<h3 id="notclosed">Why do I get a "Warning: You did not close the PDF Document"?</h3>
+<p>You need to call close() on the PDDocument inside the finally block, if you
+don't then the document will not be closed properly.  Also, you must close all
+PDDocument objects that get created.  The following code creates <strong>two</strong>
+PDDocument objects; one from the "new PDDocument()" and the second by the load method.</p>
+<div class="codehilite"><pre><span class="n">PDDocument</span> <span
class="n">doc</span> <span class="o">=</span> <span class="k">new</span>
<span class="n">PDDocument</span><span class="o">();</span>
+<span class="k">try</span>
+<span class="o">{</span>
+   <span class="n">doc</span> <span class="o">=</span> <span class="n">PDDocument</span><span
class="o">.</span><span class="na">load</span><span class="o">(</span>
<span class="s">&quot;my.pdf&quot;</span> <span class="o">);</span>
+<span class="o">}</span>
+<span class="k">finally</span>
+<span class="o">{</span>
+   <span class="k">if</span><span class="o">(</span> <span class="n">doc</span>
<span class="o">!=</span> <span class="kc">null</span> <span class="o">)</span>
+   <span class="o">{</span>
+      <span class="n">doc</span><span class="o">.</span><span
class="na">close</span><span class="o">();</span>
+       <span class="o">}</span>
+    <span class="o">}</span>
+</pre></div>
+
+
+<h2 id="text-extraction_1">Text Extraction</h2>
+<h3 id="notext">How come I am not getting any text from the PDF document?</h3>
+<p>Text extraction from a pdf document is a complicated task and there are many factors
+involved that effect the possibility and accuracy of text extraction.  It would be helpful
+to the PDFBox team if you could try a couple things.</p>
+<ul>
+<li>Open the PDF in Acrobat and try to extract text from there.  If Acrobat can extract
text then PDFBox 
+should be able to as well and it is a bug if it cannot.  If Acrobat cannot extract text then
PDFBox 'probably' cannot either.</li>
+<li>It might really be an image instead of text.  Some PDF documents are just images
that have been scanned in.
+You can tell by using the selection tool in Acrobat, if you can't select any text then it
is probably an image.</li>
+</ul>
+<h3 id="gibberish">How come I am getting gibberish(G38G43G36G51G5) when extracting
text?</h3>
+<p>This is because the characters in a PDF document can use a custom encoding
+instead of unicode or ASCII.  When you see gibberish text then it
+probably means that a meaningless internal encoding is being used.  The
+only way to access the text is to use OCR.  This may be a future
+enhancement.</p>
+<h3 id="fontwidth">What does "java.io.IOException: Can't handle font width" mean?</h3>
+<p>This probably means that the "Resources" directory is not in your classpath. The
+Resources directory is included in the PDFBox jar so this is only a problem if you
+are building PDFBox yourself and not using the binary.</p>
+<h3 id="permission">Why do I get "You do not have permission to extract text" on some
documents?</h3>
+<p>PDF documents have certain security permissions that can be applied to them and
two 
+passwords associated with them, a user password and a master password. If the "cannot extract
text"
+permission bit is set then you need to decrypt the document with the master password in order
+to extract the text.</p>
+<h2 id="partially">Can't we just extract the text without parsing the whole document
or extract text as it is parsed.</h2>
+<p>Not really, for a couple reasons.</p>
+<ul>
+<li>If the document is encrypted then you need to parse at least until the encryption
dictionary before 
+you can decrypt.</li>
+<li>Sometimes the PDFont contains vital information needed for text extraction.</li>
+<li>Text on a page does not have to be drawn in reading order. For example: if the
page said "Hello World",
+the pdf could have been written such that "World" gets drawn and then the cursor moves to
the left and 
+the word "Hello" is drawn.</li>
+</ul> 
+            </div>
+        </div>
+    </div>
+
+      <footer id="copyright">
+          <div class="row-fluid">
+              <div class="span3">
+                  <!-- nothing in here on purpose -->
+              </div>
+              
+              <div class="span9">
+                  <p>Copyright © 2013 <a href="http://www.apache.org/">The
Apache Software Foundation</a>, Licensed under the <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache
License, Version 2.0</a>. <br/>
+                  Apache PDFBox, PDFBox, Apache, the Apache feather logo and the Apache PDFBox
project logos are trademarks of The Apache Software Foundation.</p>
+              </div>
+          </div>
+      </footer>
+      
+  </body>
+</html>



Mime
View raw message