Return-Path: X-Original-To: apmail-pdfbox-users-archive@www.apache.org Delivered-To: apmail-pdfbox-users-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 291ED106A8 for ; Fri, 31 May 2013 17:13:59 +0000 (UTC) Received: (qmail 72815 invoked by uid 500); 31 May 2013 17:13:58 -0000 Delivered-To: apmail-pdfbox-users-archive@pdfbox.apache.org Received: (qmail 72717 invoked by uid 500); 31 May 2013 17:13:58 -0000 Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@pdfbox.apache.org Delivered-To: mailing list users@pdfbox.apache.org Received: (qmail 72692 invoked by uid 99); 31 May 2013 17:13:57 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 31 May 2013 17:13:57 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,UNPARSEABLE_RELAY X-Spam-Check-By: apache.org Received-SPF: error (nike.apache.org: local policy) Received: from [216.82.254.110] (HELO mail1.bemta7.messagelabs.com) (216.82.254.110) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 31 May 2013 17:13:51 +0000 Received: from [216.82.255.3:45324] by server-14.bemta-7.messagelabs.com id 9B/6C-12568-62AD8A15; Fri, 31 May 2013 17:13:10 +0000 X-Env-Sender: ekimber@rsicms.com X-Msg-Ref: server-6.tower-209.messagelabs.com!1370020385!7930644!13 X-Originating-IP: [216.166.12.31] X-StarScan-Received: X-StarScan-Version: 6.9.6; banners=-,-,- X-VirusChecked: Checked Received: (qmail 27684 invoked from network); 31 May 2013 17:13:09 -0000 Received: from out001.collaborationhost.net (HELO out001.collaborationhost.net) (216.166.12.31) by server-6.tower-209.messagelabs.com with RC4-SHA encrypted SMTP; 31 May 2013 17:13:09 -0000 Received: from AUSP01VMBX30.collaborationhost.net ([10.2.12.44]) by AUSP01MHUB01.collaborationhost.net ([10.2.8.25]) with mapi; Fri, 31 May 2013 12:12:41 -0500 From: Eliot Kimber To: "users@pdfbox.apache.org" Date: Fri, 31 May 2013 12:12:40 -0500 Subject: Re: question Thread-Topic: question Thread-Index: Ac5eIH358bh0ypiZReqBX86RWZD8ngAAZAwJ Message-ID: In-Reply-To: Accept-Language: en-US Content-Language: en X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org The last time I had to extract right-to-left text from PDF the main issue was that the text is in the data stream in the order it's placed on the page, not the reading order, meaning that the characters for a right-to-lef= t word would be "tac" not "cat" as they would be in XML, for example. If Arabic numbers are rendered right-to-left then what you're seeing in the PDF reflects that. That is, the data stream reflects the order the characters are placed on th= e page, not necessarily their source order (the order they would occur in XML or in a wordprocessing document). So you may have no choice but to assume all numbers are right-to-left or tr= y to find other clues to indicate the reading order, because of course there could be reading order changes within text that for example renders English words left-to-right within right-to-left text. The work I did was converting Arabic ledgers to HTML so I didn't have to tr= y to correctly reflect the reading order because I was just creating a visual representation, but I know it came as a bit of a surprise that the order of characters in the PDF reflected the order as presented, not the reading order, at least in the samples I had. I guess it would be possible to construct PDFs where the characters can occur in the PDF data in reading order and the drawing commands produce the correct order as presented. Cheers, E. On 5/31/13 10:36 AM, "soleymani mohsen" wrote: > hello > I'am usnig your API, it's very well but i have a question ? > i use pdfbox( and use icu4j-51 and also call setSortByPosition(true) > method ) for text extraction from right to left languages ( hebrew / > persian / arabic ) pdf >=20 > all things are ok but numbers get right to left for example : 1984 is > parsed 4891 or > 12345 go into 54321 >=20 > please help me what should i do? > thank you. --=20 Eliot Kimber Senior Solutions Architect, RSI Content Solutions "Bringing Strategy, Content, and Technology Together" Main: 512.554.9368 www.rsicms.com www.rsuitecms.com Book: DITA For Practitioners, from XML Press, http://xmlpress.net/publications/dita/practitioners-1/