pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From HQS <hqsoftwa...@gmail.com>
Subject Re: 2 questions
Date Fri, 07 Mar 2014 17:24:18 GMT
Thank you all for those accurate answers.
I will give a try to the geometrical approach based on the (x, y) coordinates of the characters.

Best regards,


Le 7 mars 2014 à 13:25, Peter Murray-Rust <pm286@cam.ac.uk> a écrit :

> On Fri, Mar 7, 2014 at 11:16 AM, Confidential Confidential <
> hqsoftwares@gmail.com> wrote:
>> Sirs,
>> I had already thought about this graphical approach to reconstruct the
>> words. I've let it down because I'm a bit sceptical on the reliability of
>> such a method. I can't help thinking that it will not be a 100% sure
>> method. I understand why a CAD software would produce such an output,
>> though (thank you for this new word that I didn't know "boustrophedonic",
>> but it explains well the result obtained).
> It's not as bad as you think. We have re-constructed the text from hundreds
> of scientific papers (so probably nearly a million words) and found very
> few problems. The reason we are doing this rather than using PDFBox tools
> is that scientific (and especially maths) PDFs contain may diacritics, high
> Unicode points, occasional graphics strokes, variable font size and style,
> ligatures, non-horizontal text, etc.
> For running text it works very well - assuming that the characters announce
> their widths. Then - roughly - "ab" is a word if
> x(a) + width(a)*fontSize(a) + tolerance >= x(b)
> else we can *crudely* estimate the number of intervening spaces (this is
> very suspect as publishers may elide concatenated spaces).
> All standard Fonts (see PDF spec) should announce their widths.
> Unfortunately scientific publishers use some of the worst constructed fonts
> in the world and sometimes we have to guess - by surveying a body of
> character positions and trying to work out spaces and font-type.
>> Supposing that the characters appear in a totally arbitrary order,
>> detecting that they're on the same line is more or less piece of cake
>> (except if I need to introduce a tolerance, which makes things more
>> difficult),
> In a modern PDF we find that all characters on the same line tend to have
> equal y-coords to at least 3 decimals. The problem is that OCR'ed
> characters may have variable y because of rounding errors and antialiasing.
>> but grouping the characters according to their X position is
>> not at all an easy task.
> The order should be fairly clear. The problems are:
> * spaces (see above)
> * hyphens at line-end (this requires heuristics - maybe lookup in Wordnet)
> - we generally solve > 90%. Hyphens in chemistry are meaningful
> * diacritics. Some characters have diacritics with the same x (e.g. E and
> acute). These can occur in variable order. Where possible we try to
> recreate a single Unicode point.
> * over and underbars
> * ligatures (in "waffle") their may be 6 characters or only 4 w-a-ffl-e. We
> split the latter.
>> But this is not an issue, my problem is more the fact that this method may
>> not be 100% reliable. What do you think ?
> We are committed to solving it for English-language science and European
> personal names. The worst case is probably slanted text in diagrams.
>> As for the technical part (overloading the processText), it's ok, thanks
>> for the advice.
>> Best regards
>> Julien
>> --
> Peter Murray-Rust
> Reader in Molecular Informatics
> Unilever Centre, Dep. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069

View raw message