Return-Path: Delivered-To: apmail-incubator-pdfbox-users-archive@minotaur.apache.org Received: (qmail 38920 invoked from network); 20 Feb 2009 19:54:57 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 20 Feb 2009 19:54:57 -0000 Received: (qmail 46006 invoked by uid 500); 20 Feb 2009 19:54:57 -0000 Delivered-To: apmail-incubator-pdfbox-users-archive@incubator.apache.org Received: (qmail 45994 invoked by uid 500); 20 Feb 2009 19:54:57 -0000 Mailing-List: contact pdfbox-users-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: pdfbox-users@incubator.apache.org Delivered-To: mailing list pdfbox-users@incubator.apache.org Received: (qmail 45983 invoked by uid 99); 20 Feb 2009 19:54:57 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 20 Feb 2009 11:54:57 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of andreasg123@gmail.com designates 209.85.200.172 as permitted sender) Received: from [209.85.200.172] (HELO wf-out-1314.google.com) (209.85.200.172) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 20 Feb 2009 19:54:49 +0000 Received: by wf-out-1314.google.com with SMTP id 25so1184696wfc.21 for ; Fri, 20 Feb 2009 11:54:29 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=mA6PcbjbHdugxdG1s5Kw7hIYOu1uf0q8Zip9Mbs8Px4=; b=gratZqgWEtWPYowa8V6NicQTq+lPi297ySOY3wafvZBkKo0e/bp5tYIeHozTLrLLma X1V0+W7+GG63JgHEyLjqOFq+RPNFN3qBGAuE/3YYRwZEcD4Bf6XKUJFHleuiw6cWErDC u1/gpLETsfbCUvlIjorcf3VSgFaChhNykaqDs= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=nA6PeOxIs5mE0Hqcx46ntAeoG0qLuuL/354+1R1PJXfRfLdp2eda+fOOX5zRmdm09j n56gGu9GPwfR8VrAGxpx39S1A5sZskn5dlLvf8QeEBUu5Z+1XBZ9hvz8q2c6ssDCM/u5 cyAavO4ohYDkLIgZyOn6QcasdP+SVtarYq0+s= MIME-Version: 1.0 Received: by 10.142.218.4 with SMTP id q4mr571624wfg.74.1235159666805; Fri, 20 Feb 2009 11:54:26 -0800 (PST) In-Reply-To: <499E7FDF.8040702@lehmi.de> References: <40a75b980902191721p43cae763id63af42e2f8dac61@mail.gmail.com> <499E7FDF.8040702@lehmi.de> Date: Fri, 20 Feb 2009 11:54:26 -0800 Message-ID: <40a75b980902201154t6e82527et57c41f1c5a3b55f0@mail.gmail.com> Subject: Re: Font size and text height in PDFBox 0.8.0 From: Andreas Girgensohn To: pdfbox-users@incubator.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Hi Andreas Thanks for the explanation on the font size. This still leaves the issue with the zero height. This mostly happens when the font size is one like in my previous example. I found one example with a sensible font size where the height still is zero. In this case, the width is very small so that this could be a scaling problem. http://place.fxpal.com/zero-height.pdf Here is an example for the other font related problem that causes an exception in COSDictionary.getNameAsString because COSString{HeadingPaginationFont} is stored in the dictionary. http://place.fxpal.com/pagination-font.pdf Please let me know if you would like to have more samples. Andreas On Fri, Feb 20, 2009 at 2:03 AM, info@lehmi.de wrote: > Hi Andreas > >> I'm using PDFBox to extract text, bounding boxes, and font information >> from PDF files from a variety of sources. Mostly in files with Type 3 >> fonts but also in others org.apache.pdfbox.util.TextPosition does not >> return the correct information. In those cases, getHeight returns 0 >> and getFontSize returns 1 (the latter happens much more frequently). >> PDFBox 0.8.0 (from the svn truck) addresses the issue for about one >> third of the documents that had problems in PDFBox 0.7.3. Here is an >> example of a document that it especially bad. PDFont also does not >> have any base font information, maybe because of the Type 3 fonts. > The problem is the way some pdf-generators produces their documents. > There is the pdf command Tj to set the font size directly and that is > the result you see using TextPositon.getFontSize(). But in many cases > the font size is set to default size 1 and it is scaled to the real size > through the textmatrix. PDFBox reads and uses both to draw the string > with the right scaling. So every time the expected result is the same, > wether the pdf-doucment uses Tf =3D 12 and Tm =3D 1 or the other way roun= d > Tf =3D 1 and Tm =3D 12. > I'll extend the TextPosition to get the size as a combination of the > fontsize and the scaling. > > >> P.S.: For a few documents, I ran into a different font related issue >> (see stack trace below). I added a print statement to determine the >> values that cause the problem. > Can you provide us an example for this issue? > > > Andreas Lehmk=FChler