pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maruan Sahyoun <sahy...@fileaffairs.de>
Subject Re: Text extracting error
Date Wed, 28 Nov 2018 10:26:29 GMT
Hi,

> We've tried to extract text from PDF
> When we tried to extract Korean from text in PDF file, the order of those
> have been broken while English was done well.
> This does not mean that Korean is not extracted from PDF, it is well done,
> but sequence has some problem.
> This Problem occurred when
> 1. if PDF files have chart
> 2. size of the character is different one another
> 
> when we extracted PDF that have chart, then the text in the lowest row
> shows at the beginning and the text in the highest row shows at the end
> 
> ex) | 가 | 나 | (in the chart)
>      |다 | 라 |
> -> 다라
> 가나(extracted)
> 
> and when PDF has multiple text size and font
> the smallest and the the most simple font text have been extracted at the
> beginning and
> the largest and less simple text font text have been extracted at the end.
> 
> please check if this is a bug when extracting Korean
> 
> public static void extractStringfromPDF() throws IOException{
>       final FileChooser filechooser = new FileChooser();
>       File file = filechooser.showOpenDialog(null);
>       try {
>          PDDocument document = PDDocument.load(file);
>          PDFTextStripper pdfStripper = new PDFTextStripper();
>          String text = pdfStripper.getText(document);
> 
>          File txtFile = new File(file.getPath() + ".txt");
>          FileWriter fw = new FileWriter(txtFile, true);
>          fw.write(text);
>          fw.flush();
>          fw.close();
>          System.out.println(text);
>          document.close();
>       }catch(Exception e) {e.printStackTrace();}
>    }
> the above code is that we used in our program


please try using the setSortByPosition option

https://pdfbox.apache.org/docs/2.0.12/javadocs/org/apache/pdfbox/text/PDFTextStripper.html#setSortByPosition-boolean-

as this will return the text in "visual" order and not in the order the text objects appear
in the PDF. Dependent on the input
PDF this might give you a better result.

Maruan


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message