pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From liyg <liyg1...@gmail.com>
Subject chinese invalid charset
Date Fri, 11 Dec 2009 03:54:26 GMT
hi:
  I am trying to extract the textual content of PDF files from my Java code.
I (am trying to) use PDFBox 0.7.3 and the examples I have found online so
far are rather limited. Basically, I did something like this:

   1. PDDocument doc = null;
   2.         try {
   3.             doc = PDDocument.load("sample.pdf");
   4.             PDFTextStripper stripper = new PDFTextStripper();
   5.             String text=stripper.getText(doc);
   6.
   7.         } finally {
   8.             if (doc != null) {
   9.                 doc.close();
   10.             }
   11.         }

and unfortunately,most of text I extract from pdf are good, chinese is
good.but some of pdf files are bad,the chinese show  like "□", and some show
like "?".
I guess the reason, invalid chinese charset is no ttf files? why some good,
some bad?I really want to konw the reason..
ps: I'm sorry for my bad English :)
thanks.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message