pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Patterson <patterd20...@gmail.com>
Subject Problem extracting and processing text from a PDF
Date Wed, 05 Apr 2017 19:46:43 GMT
Hello,



I’m trying to extract the text from a PDF that was saved from a Word
document.



I am using Release 2.0.5 of pdfbox and pdfbox-tools, with Java 8 on a
Windows machine.



I’m using the following code to get the text:



PDDocument pdDocument = PDDocument.load( pdfFile );

PDFTextStripper stripper = new PDFTextStripper();

String rawText = stripper.getText( pdDocument );

// end of code excerpt



I’m running the same code on a collection of files. Most work as expected.
I can see the following in the text of the Table of Contents:

5.15.1 ADDENDA.....................................................
................................. 1

5.15.2 YOU ARE HERE ..............................
.............................................. 2

5.15.3 INTRODUCTION ..............................
.............................................. 4



However, for two files, what I see is:

5.16 xxx SYSTEM PROCEDURES
............................................................
1

 ADDENDA......................................
......................................................... 1 5.16.1

YOU ARE HERE ..............................
........................................................
2 5.16.2

INTRODUCTION .......................................................................................
4 5.16.3



Note: the outline numbers (5.16.1, etc.) are at the end of the line, not at
the beginning.



A)  Is this a known, solvable problem?

B)  If not, is there a different way I can try to extract the data?

C)  If not, can I help debug/diagnose the problem? I cannot send the
offending PDF file out of my system.

Thanks



Dave Patterson

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message