pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maruan Sahyoun <sahy...@fileaffairs.de>
Subject Re: Problem extracting and processing text from a PDF
Date Wed, 05 Apr 2017 20:02:36 GMT
Hi,

> Am 05.04.2017 um 21:46 schrieb David Patterson <patterd20850@gmail.com>:
> 
> Hello,
> 
> 
> 
> I’m trying to extract the text from a PDF that was saved from a Word
> document.
> 
> 
> 
> I am using Release 2.0.5 of pdfbox and pdfbox-tools, with Java 8 on a
> Windows machine.
> 
> 
> 
> I’m using the following code to get the text:
> 
> 
> 
> PDDocument pdDocument = PDDocument.load( pdfFile );
> 
> PDFTextStripper stripper = new PDFTextStripper();
> 
> String rawText = stripper.getText( pdDocument );
> 
> // end of code excerpt
> 
> 
> 
> I’m running the same code on a collection of files. Most work as expected.
> I can see the following in the text of the Table of Contents:
> 
> 5.15.1 ADDENDA.....................................................
> ................................. 1
> 
> 5.15.2 YOU ARE HERE ..............................
> .............................................. 2
> 
> 5.15.3 INTRODUCTION ..............................
> .............................................. 4
> 
> 
> 
> However, for two files, what I see is:
> 
> 5.16 xxx SYSTEM PROCEDURES
> ............................................................
> 1
> 
> ADDENDA......................................
> ......................................................... 1 5.16.1
> 
> YOU ARE HERE ..............................
> ........................................................
> 2 5.16.2
> 
> INTRODUCTION .......................................................................................
> 4 5.16.3
> 
> 
> 
> Note: the outline numbers (5.16.1, etc.) are at the end of the line, not at
> the beginning.
> 
> 
> 
> A)  Is this a known, solvable problem?
> 
> B)  If not, is there a different way I can try to extract the data?
> 
> C)  If not, can I help debug/diagnose the problem? I cannot send the
> offending PDF file out of my system.


try PDFTextStripper.setSortByPosition(true); 
https://pdfbox.apache.org/docs/2.0.2/javadocs/org/apache/pdfbox/text/PDFTextStripper.html#setSortByPosition(boolean)

BR
Maruan


> 
> Thanks
> 
> 
> 
> Dave Patterson


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message