pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Amir H. Jadidinejad" <amir.jad...@yahoo.com.INVALID>
Subject Re: Problem with mixed RTL/LTR pdfs
Date Sat, 02 Aug 2014 22:27:59 GMT
After reading "PDFTextStripper.java", I think it's a bug.
This class has a variable "isRtlDominant" in "writePage" function, which is defined as follows:    
boolean isRtlDominant = rtlCount > ltrCount;
This class clearly count the number of rtl characters and decide if the whole content should
be revered or not. It's not true, it must operate on each word, not the whole document.
Any idea to solve the problem with minimum changes is welcomed.
Thanks.



________________________________
 From: Amir H. Jadidinejad <amir.jadidi@yahoo.com.INVALID>
To: user pdfbox <users@pdfbox.apache.org> 
Sent: Sunday, August 3, 2014 1:15 AM
Subject: Problem with mixed RTL/LTR pdfs
 


Hi,
I can extract the content of a monolingual PDF files using the following code:
        PDFTextStripper stripper = new PDFTextStripper();
        PDDocument doc = PDDocument.load(file);
        stripper.setSortByPosition(true);
        String txt = stripper.getText(doc);
        doc.close();


It's perfect when the input document is monolingual.

The problem is that when the input document is a combination of right-to-left and left-to-right
languages, the output characters of one language is reversed!

A sample bilingual pdf document is attached.

Would you please help me in this issue?

Thanks.
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message