pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Patterson <patterd20...@gmail.com>
Subject Re: Problem extracting and processing text from a PDF
Date Thu, 06 Apr 2017 19:12:42 GMT
Thank you very much.

I used the PDFTextStripper.setSortByPosition(true) method call, and the
problem is completely cleared up.

Dave P

On Wed, Apr 5, 2017 at 4:02 PM, Maruan Sahyoun <sahyoun@fileaffairs.de>
wrote:

> Hi,
>
> > Am 05.04.2017 um 21:46 schrieb David Patterson <patterd20850@gmail.com>:
> >
> > Hello,
> >
> >
> >
> > I’m trying to extract the text from a PDF that was saved from a Word
> > document.
> >
> >
> >
> > I am using Release 2.0.5 of pdfbox and pdfbox-tools, with Java 8 on a
> > Windows machine.
> >
> >
> >
> > I’m using the following code to get the text:
> >
> >
> >
> > PDDocument pdDocument = PDDocument.load( pdfFile );
> >
> > PDFTextStripper stripper = new PDFTextStripper();
> >
> > String rawText = stripper.getText( pdDocument );
> >
> > // end of code excerpt
> >
> >
> >
> > I’m running the same code on a collection of files. Most work as
> expected.
> > I can see the following in the text of the Table of Contents:
> >
> > 5.15.1 ADDENDA.....................................................
> > ................................. 1
> >
> > 5.15.2 YOU ARE HERE ..............................
> > .............................................. 2
> >
> > 5.15.3 INTRODUCTION ..............................
> > .............................................. 4
> >
> >
> >
> > However, for two files, what I see is:
> >
> > 5.16 xxx SYSTEM PROCEDURES
> > ............................................................
> > 1
> >
> > ADDENDA......................................
> > ......................................................... 1 5.16.1
> >
> > YOU ARE HERE ..............................
> > ........................................................
> > 2 5.16.2
> >
> > INTRODUCTION ..............................
> .........................................................
> > 4 5.16.3
> >
> >
> >
> > Note: the outline numbers (5.16.1, etc.) are at the end of the line, not
> at
> > the beginning.
> >
> >
> >
> > A)  Is this a known, solvable problem?
> >
> > B)  If not, is there a different way I can try to extract the data?
> >
> > C)  If not, can I help debug/diagnose the problem? I cannot send the
> > offending PDF file out of my system.
>
>
> try PDFTextStripper.setSortByPosition(true);
> https://pdfbox.apache.org/docs/2.0.2/javadocs/org/apache/pdfbox/text/
> PDFTextStripper.html#setSortByPosition(boolean)
>
> BR
> Maruan
>
>
> >
> > Thanks
> >
> >
> >
> > Dave Patterson
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message