pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Diogo Ribeiro <diogo_ribe...@hotmail.com>
Subject Extract text from PDF, wrong sort order
Date Sat, 16 Jan 2016 11:52:00 GMT
Hi guys,

I'm using PDFBox 1.8.10 to extract some text from a PDF (see attachment).

The output lines are not correctly sorted.

Got:

1/435 S LOPES CÂNDIDO FELIX LOPESABEL DIA 27-09-1964
FRANCISCA MARIA DIAS

Was expecting:

1/435 ABEL DIAS LOPES CÂNDIDO FELIX LOPES 27-09-1964
FRANCISCA MARIA DIAS

My simple code:

         PDDocument pdf = PDDocument.load(new File(FILE_PATH));

        PDFTextStripper stripper = new PDFTextStripper();

        stripper.setStartPage(1);
        stripper.setEndPage(1);
        stripper.setSortByPosition(true);

        String plainText = stripper.getText(pdf);

        System.out.println(plainText);


Thanks in advance.
 		 	   		  
Mime
View raw message