Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm
Precedence: bulk
Reply-To: users@pdfbox.apache.org
Received-SPF: pass (athena.apache.org: domain of hqsoftwares@gmail.com
 designates 74.125.82.170 as permitted sender)
Content-Type: text/plain; charset=iso-8859-1
Mime-Version: 1.0 (Mac OS X Mail 7.2 \(1874\))
Subject: Re: 2 questions
From: HQS <hqsoftwares@gmail.com>
In-Reply-To: 
 <CAD2k14NLNg6JBkQP3iNVJ6LTJUJN+PQ=hTNdaNF9AScXu6D0Aw@mail.gmail.com>
Date: Fri, 7 Mar 2014 18:24:18 +0100
Content-Transfer-Encoding: quoted-printable
Message-Id: <BB4237E1-C2DA-4009-997A-A2FF9E4E2BE1@gmail.com>
References: <2CD4715D-5673-469D-B10F-A3792A0BC484@gmail.com>
 <CAOg0V16OnTUwstsEY+xikt__+efF6bXQpkL5Kr0iHk5s+eeX8A@mail.gmail.com>
 <CAD2k14NLNg6JBkQP3iNVJ6LTJUJN+PQ=hTNdaNF9AScXu6D0Aw@mail.gmail.com>
To: users@pdfbox.apache.org

Thank you all for those accurate answers.
I will give a try to the geometrical approach based on the (x, y) =
coordinates of the characters.

Best regards,

Julien

Le 7 mars 2014 =E0 13:25, Peter Murray-Rust <pm286@cam.ac.uk> a =E9crit =
:

> On Fri, Mar 7, 2014 at 11:16 AM, Confidential Confidential <
> hqsoftwares@gmail.com> wrote:
>=20
>> Sirs,
>>=20
>> I had already thought about this graphical approach to reconstruct =
the
>> words. I've let it down because I'm a bit sceptical on the =
reliability of
>> such a method. I can't help thinking that it will not be a 100% sure
>> method. I understand why a CAD software would produce such an output,
>> though (thank you for this new word that I didn't know =
"boustrophedonic",
>> but it explains well the result obtained).
>>=20
>=20
> It's not as bad as you think. We have re-constructed the text from =
hundreds
> of scientific papers (so probably nearly a million words) and found =
very
> few problems. The reason we are doing this rather than using PDFBox =
tools
> is that scientific (and especially maths) PDFs contain may diacritics, =
high
> Unicode points, occasional graphics strokes, variable font size and =
style,
> ligatures, non-horizontal text, etc.
>=20
> For running text it works very well - assuming that the characters =
announce
> their widths. Then - roughly - "ab" is a word if
>=20
> x(a) + width(a)*fontSize(a) + tolerance >=3D x(b)
>=20
> else we can *crudely* estimate the number of intervening spaces (this =
is
> very suspect as publishers may elide concatenated spaces).
>=20
> All standard Fonts (see PDF spec) should announce their widths.
> Unfortunately scientific publishers use some of the worst constructed =
fonts
> in the world and sometimes we have to guess - by surveying a body of
> character positions and trying to work out spaces and font-type.
>=20
>=20
>> Supposing that the characters appear in a totally arbitrary order,
>> detecting that they're on the same line is more or less piece of cake
>> (except if I need to introduce a tolerance, which makes things more
>> difficult),
>=20
>=20
> In a modern PDF we find that all characters on the same line tend to =
have
> equal y-coords to at least 3 decimals. The problem is that OCR'ed
> characters may have variable y because of rounding errors and =
antialiasing.
>=20
>=20
>=20
>> but grouping the characters according to their X position is
>> not at all an easy task.
>>=20
>=20
> The order should be fairly clear. The problems are:
> * spaces (see above)
> * hyphens at line-end (this requires heuristics - maybe lookup in =
Wordnet)
> - we generally solve > 90%. Hyphens in chemistry are meaningful
> * diacritics. Some characters have diacritics with the same x (e.g. E =
and
> acute). These can occur in variable order. Where possible we try to
> recreate a single Unicode point.
> * over and underbars
> * ligatures (in "waffle") their may be 6 characters or only 4 =
w-a-ffl-e. We
> split the latter.
>=20
>=20
>>=20
>> But this is not an issue, my problem is more the fact that this =
method may
>> not be 100% reliable. What do you think ?
>>=20
>=20
> We are committed to solving it for English-language science and =
European
> personal names. The worst case is probably slanted text in diagrams.
>=20
>=20
>>=20
>> As for the technical part (overloading the processText), it's ok, =
thanks
>> for the advice.
>>=20
>> Best regards
>>=20
>> Julien
>>=20
>>=20
>>=20
>> --
> Peter Murray-Rust
> Reader in Molecular Informatics
> Unilever Centre, Dep. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069