pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject AW: Text extracting error
Date Wed, 28 Nov 2018 10:27:08 GMT
Hi, please try the setSortByPosition() method of the stripper. See also the 

Gesendet mit der Telekom Mail App

--- Original-Nachricht ---
Von: 김보섭
Betreff: Text extracting error
Datum: 28.11.2018, 11:01 Uhr
An: users@pdfbox.apache.org

We've tried to extract text from PDF
When we tried to extract Korean from text in PDF file, the order of those
have been broken while English was done well.
This does not mean that Korean is not extracted from PDF, it is well done,
but sequence has some problem.
This Problem occurred when
1. if PDF files have chart
2. size of the character is different one another

when we extracted PDF that have chart, then the text in the lowest row
shows at the beginning and the text in the highest row shows at the end

ex) | 가 | 나 | (in the chart)
|다 | 라 |
-> 다라

and when PDF has multiple text size and font
the smallest and the the most simple font text have been extracted at the
beginning and
the largest and less simple text font text have been extracted at the end.

please check if this is a bug when extracting Korean

public static void extractStringfromPDF() throws IOException{
final FileChooser filechooser = new FileChooser();
File file = filechooser.showOpenDialog(null
<http://filechooser.showOpenDialog(null> );
try {
PDDocument document = PDDocument.load(file);
PDFTextStripper pdfStripper = new PDFTextStripper();
String text = pdfStripper.getText(document
<http://pdfStripper.getText(document> );

File txtFile = new File(file.getPath <http://file.getPath> () + ".txt");
FileWriter fw = new FileWriter(txtFile, true);
fw.close <http://fw.close> ();
System.out.println(text <http://System.out.println(text> );
document.close <http://document.close> ();
}catch(Exception e) {e.printStackTrace <http://e.printStackTrace> ();}
the above code is that we used in our program

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message