Return-Path: X-Original-To: apmail-pdfbox-users-archive@www.apache.org Delivered-To: apmail-pdfbox-users-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 365D110966 for ; Fri, 7 Mar 2014 04:37:25 +0000 (UTC) Received: (qmail 96040 invoked by uid 500); 7 Mar 2014 04:37:24 -0000 Delivered-To: apmail-pdfbox-users-archive@pdfbox.apache.org Received: (qmail 95656 invoked by uid 500); 7 Mar 2014 04:37:14 -0000 Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@pdfbox.apache.org Delivered-To: mailing list users@pdfbox.apache.org Received: (qmail 95633 invoked by uid 99); 7 Mar 2014 04:37:10 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 07 Mar 2014 04:37:10 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of peter.murray.rust@googlemail.com designates 74.125.82.171 as permitted sender) Received: from [74.125.82.171] (HELO mail-we0-f171.google.com) (74.125.82.171) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 07 Mar 2014 04:37:06 +0000 Received: by mail-we0-f171.google.com with SMTP id t61so4338250wes.30 for ; Thu, 06 Mar 2014 20:36:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:content-type; bh=22EqNcQinGw7qvmO2l0OeyrdRGEYCRXpIloxADXMWg8=; b=PgviuO7Rpo1oXlJAaDcY3XAmWTK3+N8SveITi3+Q1yG7JZgaDCwxWYNWyZnTozG2jL E0wCKMZpo0LDOhEknLNE37fcbn+hIlvvxiOS2lAKzixBREOW5on4rGfiOQ24qfS++G8A hKIXVjCte2L7cvxm3DzkgkY248WDpqQn3yEkwkVEdBHXLNaIVEERv+8FRhW29kJBnAuM hzg5c6YIWzgoBtg18dxJg1CuyNq+At9w28G7JDncSRA/zr6Y7dRHYT1IZrbD8Axy74LF 4gBw0PriFcQ6nyyZ0gUDgqVW8szPsaNFxbRfTiXLcv51G14UNFpD8D+XunhGNzCm4abJ k6HA== MIME-Version: 1.0 X-Received: by 10.194.86.130 with SMTP id p2mr552776wjz.88.1394167005197; Thu, 06 Mar 2014 20:36:45 -0800 (PST) Sender: peter.murray.rust@googlemail.com Received: by 10.216.158.196 with HTTP; Thu, 6 Mar 2014 20:36:45 -0800 (PST) In-Reply-To: References: <2CD4715D-5673-469D-B10F-A3792A0BC484@gmail.com> <00F11537-0F71-46D5-B9AD-E863A89FD62D@fileaffairs.de> <6A86C1C6-CBCB-4347-89CC-450E8F20CA88@gmail.com> Date: Fri, 7 Mar 2014 12:36:45 +0800 X-Google-Sender-Auth: LAlqTqi48_bSm0xLeJgKTNyJYnA Message-ID: Subject: Re: 2 questions From: Peter Murray-Rust To: users@pdfbox.apache.org Content-Type: multipart/alternative; boundary=089e0102eda41a3b9f04f3fccd94 X-Virus-Checked: Checked by ClamAV on apache.org --089e0102eda41a3b9f04f3fccd94 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Agreed. In our reconstruction of the scientific content of technical documents from PDF (AMI2, http://www.bitbucket.org/petermr/ami2 ) we throw away all character groupings from the PDF - render each to SVG with its coordinates and other attributes (stroke, font-size, etc.) This is because there is no consistency in how PDF tools create character grouping - they are often split at kerning points rather than whitespace - and they may be boustrophedonic (http://dictionary.reference.com/browse/boustrophedonic ). The only reliable strategy is the extract the coordinates, font size and (hopefully) the width of the character. This allows phrases to be generated. Creating sentences and paragraphs, lists and tables is hard and discipline-dependent (think about hyphenation). The positive side of doing this is that when you only have pixel information (about half the diagrams we see) then you have to reconstruct the characters by OCR. The result of this then merges with the character-based approach. BTW if anyone has a good pointer to an Open Pure Java OCR tool we'd be delighted as I'm hacking my own (there are ancillary reasons for this). Tesseract is not Pure Java, JavaOCR has become very complex and Lookup doesn't seem to provide fonts. Currently we are hacking this from a few high-quality sets of glyphs (such as Wikipedia entries). Maybe we should be using the outline glyphs? On Fri, Mar 7, 2014 at 4:55 AM, Maruan Sahyoun wrot= e: > Hi, > > you could use PDFStreamEngine and overwrite > http://pdfbox.apache.org/docs/1.8.4/javadocs/org/apache/pdfbox/util/PDFSt= reamEngine.html#processTextPosition%28org.apache.pdfbox.util.TextPosition%2= 9 > > this gives you the position of all characters. You would then need to > match/compare these to the string pattern you are looking for accumulatin= g > the positions. After that you would have the area covered by the string > which you could use to e.g. overlay a button and/or link element. > > BR > Maruan Sahyoun > > Am 06.03.2014 um 21:30 schrieb Olaf Dr=FCmmer : > > > You could use x and y position and rotation information to determine > whether two given characters - given their size - are relatively close to > each other or not and are on the same line. > > > > BT / ET is not at all guaranteed to give you strings as perceived by a > human. > > > > Olaf > > > > > > Am 6 Mar 2014 um 21:06 schrieb HQS : > > > >> Well, thanks sirs for your reactivity. > >> > >> The PDFs are generated by Autodesk Inventor (even the latest version > produces that kind of output). > >> > >> It is for one of my clients who wants an automatic transformation > >> of some specific strings in the PDF into a clickable link. > >> > >> My problem is very simple : with such a structure I have no way to kno= w > when the string ends. > >> > >> As a matter of fact all the references to be transformed are prefixed > >> with an 'I-' but there is no termination character, for instance : << > I-HOIST-042 >>. > >> Given that in the PDF I, -, H, O, (etc.), 2 are separated characters I > cannot rebuild the original string. > >> > >> I was hoping that there is a block of text (BT ... ET) but, as I > mentioned, each character is put in its own block... > >> > >> Regards, > >> > >> > >> Le 6 mars 2014 =E0 18:57, Maruan Sahyoun a > =E9crit : > >> > >>> Hi Julien, > >>> > >>> for 1) that's possible and supported - how was the document generated= ? > DTP application? > >>> for 2) PDFBox doesn't enforce a PDF version. In general it supports > all PDF files but it doesn't have full coverage of all features defined > within certain PDF versions but it should have a reasonable coverage. The= re > is no documentation on coverage yet so I can't guarantee that a specific > feature is supported. Is there something special you are looking for? > >>> > >>> BR > >>> Maruan Sahyoun > >>> > >>> Am 06.03.2014 um 18:39 schrieb HQS : > >>> > >>>> Hello all, > >>>> > >>>> 1. > >>>> Have you ever seen PDFs having this kind of (pseudo) structure : > >>>> > >>>> BT > >>>> > >>>> Tj > >>>> ET > >>>> > >>>> ? > >>>> > >>>> Which means, the strings are split into characters and there is one > block of text per character ? > >>>> It seems to be ill-formed doesn't it ? > >>>> > >>>> 2. Reminder of my first mail, what is the library compliancy > regarding PDF standards ? 1.3 to 1.7 ? > >>>> > >>>> > >>>> Thanks and regards > >>>> > >>>> Julien > >>>> > >>> > >> > > > > --=20 Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069 --089e0102eda41a3b9f04f3fccd94--