Return-Path: X-Original-To: apmail-pdfbox-users-archive@www.apache.org Delivered-To: apmail-pdfbox-users-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 84A9C9D6D for ; Sat, 18 Feb 2012 15:35:53 +0000 (UTC) Received: (qmail 28259 invoked by uid 500); 18 Feb 2012 15:35:53 -0000 Delivered-To: apmail-pdfbox-users-archive@pdfbox.apache.org Received: (qmail 28243 invoked by uid 500); 18 Feb 2012 15:35:53 -0000 Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@pdfbox.apache.org Delivered-To: mailing list users@pdfbox.apache.org Received: (qmail 28235 invoked by uid 99); 18 Feb 2012 15:35:53 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 18 Feb 2012 15:35:53 +0000 X-ASF-Spam-Status: No, hits=0.7 required=5.0 tests=RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [81.169.146.160] (HELO mo-p00-ob.rzone.de) (81.169.146.160) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 18 Feb 2012 15:35:45 +0000 DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed; t=1329579324; l=2959; s=domk; d=lehmi.de; h=Content-Transfer-Encoding:Content-Type:In-Reply-To:References: Subject:To:MIME-Version:From:Date:X-RZG-CLASS-ID:X-RZG-AUTH; bh=mx0olrNqOoT5gHGe1KipUzJcrVY=; b=drSehz3TIYA0iPCZ7YpfHTBWgUx+7pKmyvwM4G9Krl5nB4VQM52rO85OM5wV5hZud2n r9A/MohejGavV4xV3p8oaiu8S0g9EPAEEq+M8JE/K/mTpijPHs+v7OwST7oQ6bJud6qm2 lIPHjSlsIH4HZWkEtoT7TF6K+Kz0IaQ89fo= X-RZG-AUTH: :LWIAZ0WpaN8UY5o8XRz0jOyrHsdEC+nAE10OdySrgHL6ku8V1wBfgHLvRjcg X-RZG-CLASS-ID: mo00 Received: from [192.168.1.9] (dslb-088-077-251-182.pools.arcor-ip.net [88.77.251.182]) by post.strato.de (mrclete mo26) (RZmta 27.6 DYNA|AUTH) with ESMTPA id t0360bo1IEa2Tr for ; Sat, 18 Feb 2012 16:35:19 +0100 (MET) Message-ID: <4F3FC536.6050507@lehmi.de> Date: Sat, 18 Feb 2012 16:35:18 +0100 From: Andreas Lehmkuehler User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.26) Gecko/20120131 Lightning/1.0b2 Thunderbird/3.1.18 MIME-Version: 1.0 To: users@pdfbox.apache.org Subject: Re: Help needed to resolve issue with converting Arabic characters to presentation forms References: In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Virus-Checked: Checked by ClamAV on apache.org Hi Am 18.02.2012 14:40, schrieb Hamed Iravanchi: > Hi again, > > Regarding the CID-coded glyph/character mapping, and the I have some > more findings that I want to share, maybe one of you guys can point > out something that can help me get there faster. > > Using Adobe Acrobat, I was able to dig deep in the PDF file structure, > and see how the data is being read by PDFBox. > > There are two utilities in the "options" menu of Adobe Acrobat "Preflight" tool: > * "Browse Internal PDF Structure" PDFBox also provides a tool (PDFDebugger) to browse the internal structure of a pdf. > * "Browse Internal Structure of All Document Fonts" > > In the first one, I could find the "ToUnicode" mapping that I talked > about before in the font resources. The font is a type-0 one, which > has a "CIDFontType2" descendant font. The "awtFont" used to draw > characters on graphics object is read from the "FontFile2" stream > inside this object in PDF. > > There is no CID mappings in this font. CIDToGIDMap is "Identity". I'll > include a screenshot of this in the email. > > On the other hand, the second option ("Browse Internal Structure of > All Document Fonts") contains glyph details, and ALSO correct CID > mappings. It's in the following path: > Font> Internal Structure> Data Tables> Character to Glyph Mapping ('cmap') > > For each character, the data contains both correct UNICODE value > (either original or representation) and correct Glyph code. > > In the PDFBox, if I map the CID to correct UNICODE value from this > table, it should work fine. But I could not find anywhere in the > PDFBox code that such mappings are read from the PDF file, and I have > no idea where in PDF file is such information stored. > > If anyone has an idea, please let me know. I guess I've cracked the nut. :-) - PDFBox uses strings to be rendered, the same which are used for text extraction - in case of CID-encoded fonts the ToUnicode-mapping is used to get readable strings, but these strings can't be used to draw the string - in case of CID-encoded fonts we have to use the font internal id to adress the glyphs I have to clean up the code and run some tests before checking in the code. > Thanks a lot, > Hamed We have to thank you, your detailed analysis helped me to find out what piece of code is still missing. > -- Original Message: > > Hi, > > Am 16.02.2012 05:40, schrieb Hesham G.: > > Hamed , > > Nice effort .. Thanks for sharing the nice information. I hope you > will be able to overcome this, and share your solution. > > I have to agree, thanks for the details. I also dug deeper into that > part of the code more than once. The issue is the CID-coded > glyph/character mapping. Maybe I'm able to crack that nut with your > information. > > Best regards , Hesham > > --------------------------------------------- Included message : BR Andreas Lehmkühler