pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Dick <mbd...@tepper.cmu.edu>
Subject Problem with extracting text from pdf using SortByPosition
Date Mon, 17 Jun 2013 21:32:06 GMT

I'm trying to extract text from a pdf (
http://www.oca.state.pa.us/Industry/Electric/elecomp/wpp.pdf). However I'm
having trouble with the way the doc is formatted. With default settings
(sortbyposition false), the last column is not read along with the line.
I'm having more luck with setting sortbyposition true, however that messes
up some of the text (see below).

Is there a way to tweak settings to fix the text when sortbyposition is
true? Or otherwise is there a way to further troubleshoot this?

Thanks so much for any advice!


For example on page 4
*with SortByPosition true*
*TriEWaegslte  PEennenr gPyower *
*1-87P7r-i9c3e EtoA GCLomE p(9a3r3e -2453)*
*www.trieagletehnrerogyu.cgohm *
*FixedA purigcue:s t 6 3 m1o, n2t0h1 t3erm 7.29 ¢ $36.45 $72.90 $145.80*
*$20 per month *
*for each month *
*remaining in the *
*contract term*

*with SortByPosition false*
*TriEagle Energy*
*1-877-93EAGLE (933-2453)*
*Fixed price:  6 month term 7.29 ¢ $36.45 $72.90 $145.80*

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message