pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: Extract text from PDF, wrong sort order
Date Sat, 16 Jan 2016 13:10:31 GMT
Here's what I get with the 2.0 version:


1/435 CÂNDIDO FELIX LOPESABEL DIAS LOPES 27-09-1964
FRANCISCA MARIA DIAS

this is mostly correct. "CÂNDIDO FELIX LOPES" is on a higher line than 
"ABEL DIAS LOPES". The only problem is the missing line break, this can 
possibly be set with an option.


Assuming you want to extract all this to fill a database, you could also 
try the non sorted output. The only problem is to get the correct count 
per page.


Tilman

Am 16.01.2016 um 12:52 schrieb Diogo Ribeiro:
> Hi guys,
>
> I'm using PDFBox 1.8.10 to extract some text from a PDF (see attachment).
>
> The output lines are not correctly sorted.
>
> Got:
>
> 1/435 S LOPES CÂNDIDO FELIX LOPESABEL DIA 27-09-1964
> FRANCISCA MARIA DIAS
>
> Was expecting:
>
> 1/435 ABEL DIAS LOPES CÂNDIDO FELIX LOPES 27-09-1964
> FRANCISCA MARIA DIAS
>
> My simple code:
>
>          PDDocument pdf = PDDocument.load(new File(FILE_PATH));
>
>         PDFTextStripper stripper = new PDFTextStripper();
>
>         stripper.setStartPage(1);
>         stripper.setEndPage(1);
>         stripper.setSortByPosition(true);
>
>         String plainText = stripper.getText(pdf);
>
>         System.out.println(plainText);
>
>
> Thanks in advance.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message