pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: Arabic PDFs - ordering of normalized ligatures
Date Tue, 30 Apr 2019 09:15:10 GMT
Hi,

I've created https://issues.apache.org/jira/browse/PDFBOX-4531 and also 
attached a reduced version of the problem PDF.
Please verify that these are really the two lines.

But don't expect this to be fixed soon - none of us knows Arabic and it 
is extremely difficult to understand what is going on. I had one failed 
attempt to produce a reduced file because it is difficult to recognize 
the glyphs in different fonts (your mail / the PDF / the extraction).

This might also be similar to another (also unsolved) issue related to 
Thai ligatures.

1.8.* may have worked because it used icu4j and 2.0 doesn't.

What we'd really need is people who can not only fix this, also check 
the extraction of other arabic test PDFs, also keep hanging around here 
to decide whether any extraction changes are regressions, improvements 
or irrelevant.

Tilman

Am 30.04.2019 um 04:35 schrieb Elias Peterson:
> Hello,
>
> I think I'm seeing some issues concerning the handling of the Arabic lam-with-alef ligature.
 I'm attempting to process the PDF here:
> https://www.rand.org/content/dam/rand/pubs/perspectives/PE100/PE122/RAND_PE122z1.arabic.pdf
>
> When I run the ExtractText command with 2.0.15 I get the following:
> $ java -jar pdfbox-app-2.0.15.jar ExtractText -encoding UTF-8 RAND_PE122z1.arabic.pdf
output.txt
> $ head output.txt
> C O R P O R A T I O N
> منظور تحليلي
> رؤى خبير بشأن قضايا السياسات اآلنية
> االتفاق مع إيران
> األيام التي تلي
> ...
>
> The issue being with the last two lines in the above snippet where my understanding is
that the ligature لا  was normalized but that the two letters that compose it are in the
wrong order.  I was thinking that PDFBOX-684 sounded similar, and running the same PDF through
1.8.16 I see the ligature is normalized in the way I think is expected (although the interspersed
English-language words are backwards here).
>
> $ java -jar pdfbox-app-1.8.16.jar ExtractText -encoding UTF-8 RAND_PE122z1.arabic.pdf
output.txt
> ...
> $ head output.txt
> N O I T A R O P R O C
> منظور تحليلي
> رؤى خبير بشأن قضايا السياسات الآنية
> الاتفاق مع إيران
> الأيام التي تلي
> ...
>
>
> Does this look like a regression or is there possibly something else I should be trying?
 Thank you for any assistance.
>
> --Elias Peterson
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message