pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohit Goyal <Mohit.Go...@pb.com>
Subject PdfParser giving garbage character
Date Fri, 13 May 2016 06:28:50 GMT
Hi,

I have one pdf which has data in Malyalam(Indian Language). I tried to parse this data using
apache Tika I got garbage character '?' in output.


I checked Pdf using pdffont utility seems like some tounicodetable is missing.
Output of pdffont
Config Error: No display font for 'Symbol' Config Error: No display font for 'ZapfDingbats'
**name                                 type              emb sub uni object I**D
------------------------------------ ----------------- --- --- --- ---------
YTLJPR+AnjaliOldLipi                 CID TrueType      yes yes yes   1671  0
Times-Roman                          Type 1            no  no  no    1672  0
Times-Bold                           Type 1            no  no  no     127  0


Please find attached pdf.

Code:

                BufferedWriter writer=  Files.newWriter(new File("file-output.txt"), Charset.forName("UTF-8"));
BodyContentHandler handler = new BodyContentHandler(writer);
ParseContext pcontext = new ParseContext();
Metadata metadata = new Metadata();
       PDFParser pdfparser = new PDFParser();
       pdfparser.parse(inputstream, handler, metadata,pcontext);

Any suggestions??

Thanks
Mohit Goyal

________________________________


Mime
View raw message