pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 김보섭 <bosub...@gmail.com>
Subject Text extracting error
Date Wed, 28 Nov 2018 10:01:34 GMT
We've tried to extract text from PDF
When we tried to extract Korean from text in PDF file, the order of those
have been broken while English was done well.
This does not mean that Korean is not extracted from PDF, it is well done,
but sequence has some problem.
This Problem occurred when
1. if PDF files have chart
2. size of the character is different one another

when we extracted PDF that have chart, then the text in the lowest row
shows at the beginning and the text in the highest row shows at the end

ex) | 가 | 나 | (in the chart)
     |다 | 라 |
-> 다라

and when PDF has multiple text size and font
the smallest and the the most simple font text have been extracted at the
beginning and
the largest and less simple text font text have been extracted at the end.

please check if this is a bug when extracting Korean

public static void extractStringfromPDF() throws IOException{
      final FileChooser filechooser = new FileChooser();
      File file = filechooser.showOpenDialog(null);
      try {
         PDDocument document = PDDocument.load(file);
         PDFTextStripper pdfStripper = new PDFTextStripper();
         String text = pdfStripper.getText(document);

         File txtFile = new File(file.getPath() + ".txt");
         FileWriter fw = new FileWriter(txtFile, true);
      }catch(Exception e) {e.printStackTrace();}
the above code is that we used in our program

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message