pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eliot Kimber <ekim...@rsicms.com>
Subject Re: question
Date Fri, 31 May 2013 17:12:40 GMT
The last time I had to extract right-to-left text from PDF the main issue
was that the text is in the data stream in the order it's placed on the
page, not the reading order, meaning that the characters for a right-to-left
word would be "tac" not "cat" as they would be in XML, for example.

If Arabic numbers are rendered right-to-left then what you're seeing in the
PDF reflects that.

That is, the data stream reflects the order the characters are placed on the
page, not necessarily their source order (the order they would occur in XML
or in a wordprocessing document).

So you may have no choice but to assume all numbers are right-to-left or try
to find other clues to indicate the reading order, because of course there
could be reading order changes within text that for example renders English
words left-to-right within right-to-left text.

The work I did was converting Arabic ledgers to HTML so I didn't have to try
to correctly reflect the reading order because I was just creating a visual
representation, but I know it came as a bit of a surprise that the order of
characters in the PDF reflected the order as presented, not the reading
order, at least in the samples I had. I guess it would be possible to
construct PDFs where the characters can occur in the PDF data in reading
order and the drawing commands produce the correct order as presented.



On 5/31/13 10:36 AM, "soleymani mohsen" <membrown@gmail.com> wrote:

> hello
> I'am usnig your API, it's very well but i have a question ?
> i use pdfbox( and use icu4j-51 and also call setSortByPosition(true)
> method )  for text extraction from right to left languages ( hebrew /
> persian / arabic  ) pdf
> all things are ok but numbers get right to left for example : 1984 is
> parsed  4891 or
> 12345 go into 54321
> please help me what should i do?
> thank you.

Eliot Kimber
Senior Solutions Architect, RSI Content Solutions
"Bringing Strategy, Content, and Technology Together"
Main: 512.554.9368
Book: DITA For Practitioners, from XML Press,

View raw message