From users-return-11687-archive-asf-public=cust-asf.ponee.io@pdfbox.apache.org Tue Apr 30 09:15:25 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id AD67F180651 for ; Tue, 30 Apr 2019 11:15:24 +0200 (CEST) Received: (qmail 57899 invoked by uid 500); 30 Apr 2019 09:15:23 -0000 Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@pdfbox.apache.org Delivered-To: mailing list users@pdfbox.apache.org Received: (qmail 57887 invoked by uid 99); 30 Apr 2019 09:15:22 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 30 Apr 2019 09:15:22 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 2EEB7180C96 for ; Tue, 30 Apr 2019 09:15:22 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.001 X-Spam-Level: *** X-Spam-Status: No, score=3.001 tagged_above=-999 required=6.31 tests=[KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_BL_SPAMCOP_NET=2, RCVD_IN_DNSWL_NONE=-0.0001, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id QdFK1PYkWOBJ for ; Tue, 30 Apr 2019 09:15:19 +0000 (UTC) Received: from mailout12.t-online.de (mailout12.t-online.de [194.25.134.22]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 92B2F60EAA for ; Tue, 30 Apr 2019 09:15:19 +0000 (UTC) Received: from fwd10.aul.t-online.de (fwd10.aul.t-online.de [172.20.26.152]) by mailout12.t-online.de (Postfix) with SMTP id 340CF41E2712 for ; Tue, 30 Apr 2019 11:15:12 +0200 (CEST) Received: from [192.168.2.111] (SmgJ8+Z-8h0NrLpmMFSQ3gJLJS3C8nMlElbkhOxNUd42OCzNmZPQp1MF9h8tVz3gJ-@[84.151.181.98]) by fwd10.t-online.de with (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384 encrypted) esmtp id 1hLOr0-1qc5rc0; Tue, 30 Apr 2019 11:15:10 +0200 Subject: Re: Arabic PDFs - ordering of normalized ligatures To: users@pdfbox.apache.org References: From: Tilman Hausherr Message-ID: <894eedcc-fac7-364f-d431-834ee70fad99@t-online.de> Date: Tue, 30 Apr 2019 11:15:10 +0200 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:60.0) Gecko/20100101 Thunderbird/60.6.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-US X-ID: SmgJ8+Z-8h0NrLpmMFSQ3gJLJS3C8nMlElbkhOxNUd42OCzNmZPQp1MF9h8tVz3gJ- X-TOI-MSGID: 04573ebb-85bb-457a-9949-7ce0c1653019 Hi, I've created https://issues.apache.org/jira/browse/PDFBOX-4531 and also attached a reduced version of the problem PDF. Please verify that these are really the two lines. But don't expect this to be fixed soon - none of us knows Arabic and it is extremely difficult to understand what is going on. I had one failed attempt to produce a reduced file because it is difficult to recognize the glyphs in different fonts (your mail / the PDF / the extraction). This might also be similar to another (also unsolved) issue related to Thai ligatures. 1.8.* may have worked because it used icu4j and 2.0 doesn't. What we'd really need is people who can not only fix this, also check the extraction of other arabic test PDFs, also keep hanging around here to decide whether any extraction changes are regressions, improvements or irrelevant. Tilman Am 30.04.2019 um 04:35 schrieb Elias Peterson: > Hello, > > I think I'm seeing some issues concerning the handling of the Arabic lam-with-alef ligature. I'm attempting to process the PDF here: > https://www.rand.org/content/dam/rand/pubs/perspectives/PE100/PE122/RAND_PE122z1.arabic.pdf > > When I run the ExtractText command with 2.0.15 I get the following: > $ java -jar pdfbox-app-2.0.15.jar ExtractText -encoding UTF-8 RAND_PE122z1.arabic.pdf output.txt > $ head output.txt > C O R P O R A T I O N > منظور تحليلي > رؤى خبير بشأن قضايا السياسات اآلنية > االتفاق مع إيران > األيام التي تلي > ... > > The issue being with the last two lines in the above snippet where my understanding is that the ligature لا was normalized but that the two letters that compose it are in the wrong order. I was thinking that PDFBOX-684 sounded similar, and running the same PDF through 1.8.16 I see the ligature is normalized in the way I think is expected (although the interspersed English-language words are backwards here). > > $ java -jar pdfbox-app-1.8.16.jar ExtractText -encoding UTF-8 RAND_PE122z1.arabic.pdf output.txt > ... > $ head output.txt > N O I T A R O P R O C > منظور تحليلي > رؤى خبير بشأن قضايا السياسات الآنية > الاتفاق مع إيران > الأيام التي تلي > ... > > > Does this look like a regression or is there possibly something else I should be trying? Thank you for any assistance. > > --Elias Peterson > --------------------------------------------------------------------- > To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org > For additional commands, e-mail: users-help@pdfbox.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org For additional commands, e-mail: users-help@pdfbox.apache.org