Return-Path: X-Original-To: apmail-pdfbox-users-archive@www.apache.org Delivered-To: apmail-pdfbox-users-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3B94410090 for ; Fri, 7 Mar 2014 17:24:48 +0000 (UTC) Received: (qmail 80743 invoked by uid 500); 7 Mar 2014 17:24:47 -0000 Delivered-To: apmail-pdfbox-users-archive@pdfbox.apache.org Received: (qmail 80727 invoked by uid 500); 7 Mar 2014 17:24:47 -0000 Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@pdfbox.apache.org Delivered-To: mailing list users@pdfbox.apache.org Received: (qmail 80719 invoked by uid 99); 7 Mar 2014 17:24:47 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 07 Mar 2014 17:24:47 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of hqsoftwares@gmail.com designates 74.125.82.170 as permitted sender) Received: from [74.125.82.170] (HELO mail-we0-f170.google.com) (74.125.82.170) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 07 Mar 2014 17:24:42 +0000 Received: by mail-we0-f170.google.com with SMTP id w61so5283272wes.29 for ; Fri, 07 Mar 2014 09:24:21 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=content-type:mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to; bh=gbA7ZbcRSnbQJtH6ao2+bEhDjxg/XuCKBq45SBZHAPw=; b=e9YCvHmPLDuGgxfPQIui4flqJrHTmTYg1iJL1jS+vt44G3AweF6A8NPv1bFjfSjWi9 CAulYC+ujGF/wzPKLSoqXyoYZhOmrVvVaalcO1EmD7pafq8DBT4G6cakkIEa3trothjQ fEakabWlW8H5os1HVWtfX+c+2xtjLtGcqzQ6knTzeTYdQOjU/cKWKFBq7CYb2pP2Upj4 VOAKPlaSwxMySZK+tKtpj+rInq4wFHFBc8Pc83X8PfbfzicmsZO6NeBVIXQqIlzSu6f6 Eaq9GAVWVtSB5byOtYNrKHevAyDWygE/xjTGZQfANoTjLD07gLfLz3XvY+GIAK3UyQ02 00Ag== X-Received: by 10.194.234.106 with SMTP id ud10mr21417119wjc.0.1394213061192; Fri, 07 Mar 2014 09:24:21 -0800 (PST) Received: from [192.168.1.54] (52.173.92.92.rev.sfr.net. [92.92.173.52]) by mx.google.com with ESMTPSA id dk9sm12775596wjb.4.2014.03.07.09.24.19 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Fri, 07 Mar 2014 09:24:20 -0800 (PST) Content-Type: text/plain; charset=iso-8859-1 Mime-Version: 1.0 (Mac OS X Mail 7.2 \(1874\)) Subject: Re: 2 questions From: HQS In-Reply-To: Date: Fri, 7 Mar 2014 18:24:18 +0100 Content-Transfer-Encoding: quoted-printable Message-Id: References: <2CD4715D-5673-469D-B10F-A3792A0BC484@gmail.com> To: users@pdfbox.apache.org X-Mailer: Apple Mail (2.1874) X-Virus-Checked: Checked by ClamAV on apache.org Thank you all for those accurate answers. I will give a try to the geometrical approach based on the (x, y) = coordinates of the characters. Best regards, Julien Le 7 mars 2014 =E0 13:25, Peter Murray-Rust a =E9crit = : > On Fri, Mar 7, 2014 at 11:16 AM, Confidential Confidential < > hqsoftwares@gmail.com> wrote: >=20 >> Sirs, >>=20 >> I had already thought about this graphical approach to reconstruct = the >> words. I've let it down because I'm a bit sceptical on the = reliability of >> such a method. I can't help thinking that it will not be a 100% sure >> method. I understand why a CAD software would produce such an output, >> though (thank you for this new word that I didn't know = "boustrophedonic", >> but it explains well the result obtained). >>=20 >=20 > It's not as bad as you think. We have re-constructed the text from = hundreds > of scientific papers (so probably nearly a million words) and found = very > few problems. The reason we are doing this rather than using PDFBox = tools > is that scientific (and especially maths) PDFs contain may diacritics, = high > Unicode points, occasional graphics strokes, variable font size and = style, > ligatures, non-horizontal text, etc. >=20 > For running text it works very well - assuming that the characters = announce > their widths. Then - roughly - "ab" is a word if >=20 > x(a) + width(a)*fontSize(a) + tolerance >=3D x(b) >=20 > else we can *crudely* estimate the number of intervening spaces (this = is > very suspect as publishers may elide concatenated spaces). >=20 > All standard Fonts (see PDF spec) should announce their widths. > Unfortunately scientific publishers use some of the worst constructed = fonts > in the world and sometimes we have to guess - by surveying a body of > character positions and trying to work out spaces and font-type. >=20 >=20 >> Supposing that the characters appear in a totally arbitrary order, >> detecting that they're on the same line is more or less piece of cake >> (except if I need to introduce a tolerance, which makes things more >> difficult), >=20 >=20 > In a modern PDF we find that all characters on the same line tend to = have > equal y-coords to at least 3 decimals. The problem is that OCR'ed > characters may have variable y because of rounding errors and = antialiasing. >=20 >=20 >=20 >> but grouping the characters according to their X position is >> not at all an easy task. >>=20 >=20 > The order should be fairly clear. The problems are: > * spaces (see above) > * hyphens at line-end (this requires heuristics - maybe lookup in = Wordnet) > - we generally solve > 90%. Hyphens in chemistry are meaningful > * diacritics. Some characters have diacritics with the same x (e.g. E = and > acute). These can occur in variable order. Where possible we try to > recreate a single Unicode point. > * over and underbars > * ligatures (in "waffle") their may be 6 characters or only 4 = w-a-ffl-e. We > split the latter. >=20 >=20 >>=20 >> But this is not an issue, my problem is more the fact that this = method may >> not be 100% reliable. What do you think ? >>=20 >=20 > We are committed to solving it for English-language science and = European > personal names. The worst case is probably slanted text in diagrams. >=20 >=20 >>=20 >> As for the technical part (overloading the processText), it's ok, = thanks >> for the advice. >>=20 >> Best regards >>=20 >> Julien >>=20 >>=20 >>=20 >> -- > Peter Murray-Rust > Reader in Molecular Informatics > Unilever Centre, Dep. Of Chemistry > University of Cambridge > CB2 1EW, UK > +44-1223-763069