pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maruan Sahyoun <sahy...@fileaffairs.de>
Subject Re: 2 questions
Date Fri, 07 Mar 2014 17:46:32 GMT
if you need further assistance please let us know.

BR
Maruan Sahyoun

Am 07.03.2014 um 18:24 schrieb HQS <hqsoftwares@gmail.com>:

> Thank you all for those accurate answers.
> I will give a try to the geometrical approach based on the (x, y) coordinates of the
characters.
> 
> Best regards,
> 
> Julien
> 
> Le 7 mars 2014 à 13:25, Peter Murray-Rust <pm286@cam.ac.uk> a écrit :
> 
>> On Fri, Mar 7, 2014 at 11:16 AM, Confidential Confidential <
>> hqsoftwares@gmail.com> wrote:
>> 
>>> Sirs,
>>> 
>>> I had already thought about this graphical approach to reconstruct the
>>> words. I've let it down because I'm a bit sceptical on the reliability of
>>> such a method. I can't help thinking that it will not be a 100% sure
>>> method. I understand why a CAD software would produce such an output,
>>> though (thank you for this new word that I didn't know "boustrophedonic",
>>> but it explains well the result obtained).
>>> 
>> 
>> It's not as bad as you think. We have re-constructed the text from hundreds
>> of scientific papers (so probably nearly a million words) and found very
>> few problems. The reason we are doing this rather than using PDFBox tools
>> is that scientific (and especially maths) PDFs contain may diacritics, high
>> Unicode points, occasional graphics strokes, variable font size and style,
>> ligatures, non-horizontal text, etc.
>> 
>> For running text it works very well - assuming that the characters announce
>> their widths. Then - roughly - "ab" is a word if
>> 
>> x(a) + width(a)*fontSize(a) + tolerance >= x(b)
>> 
>> else we can *crudely* estimate the number of intervening spaces (this is
>> very suspect as publishers may elide concatenated spaces).
>> 
>> All standard Fonts (see PDF spec) should announce their widths.
>> Unfortunately scientific publishers use some of the worst constructed fonts
>> in the world and sometimes we have to guess - by surveying a body of
>> character positions and trying to work out spaces and font-type.
>> 
>> 
>>> Supposing that the characters appear in a totally arbitrary order,
>>> detecting that they're on the same line is more or less piece of cake
>>> (except if I need to introduce a tolerance, which makes things more
>>> difficult),
>> 
>> 
>> In a modern PDF we find that all characters on the same line tend to have
>> equal y-coords to at least 3 decimals. The problem is that OCR'ed
>> characters may have variable y because of rounding errors and antialiasing.
>> 
>> 
>> 
>>> but grouping the characters according to their X position is
>>> not at all an easy task.
>>> 
>> 
>> The order should be fairly clear. The problems are:
>> * spaces (see above)
>> * hyphens at line-end (this requires heuristics - maybe lookup in Wordnet)
>> - we generally solve > 90%. Hyphens in chemistry are meaningful
>> * diacritics. Some characters have diacritics with the same x (e.g. E and
>> acute). These can occur in variable order. Where possible we try to
>> recreate a single Unicode point.
>> * over and underbars
>> * ligatures (in "waffle") their may be 6 characters or only 4 w-a-ffl-e. We
>> split the latter.
>> 
>> 
>>> 
>>> But this is not an issue, my problem is more the fact that this method may
>>> not be 100% reliable. What do you think ?
>>> 
>> 
>> We are committed to solving it for English-language science and European
>> personal names. The worst case is probably slanted text in diagrams.
>> 
>> 
>>> 
>>> As for the technical part (overloading the processText), it's ok, thanks
>>> for the advice.
>>> 
>>> Best regards
>>> 
>>> Julien
>>> 
>>> 
>>> 
>>> --
>> Peter Murray-Rust
>> Reader in Molecular Informatics
>> Unilever Centre, Dep. Of Chemistry
>> University of Cambridge
>> CB2 1EW, UK
>> +44-1223-763069
> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message