pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Murray-Rust <pm...@cam.ac.uk>
Subject Re: 2 questions
Date Sat, 08 Mar 2014 14:14:10 GMT
The width appears to be a ratio, independent of size. It also seems to be
conventionally multiplied by 1000 (I have not found a definition for this -
I have only guessed it).

Thus a character "A" of width=600 and fontSize=10.5 appears to have
pixelwidth = 600. / 1000. * 10.5 = 6.3 pixels

I'd be grateful for confirmation or correction...


On Sat, Mar 8, 2014 at 11:12 AM, HQS <hqsoftwares@gmail.com> wrote:

> Well, I have a precision to ask to Peter, about this formula :
>
> x(a) + width(a)*fontSize(a) + tolerance >= x(b)
>
> What is the difference between « width(a) » and « fontSize(a) » ? Is it
> not enough
> to know the width of the character « a » in pixels given by the font, to
> check this assertion ?
>
> Thanks !
>
>
> Le 7 mars 2014 à 18:46, Maruan Sahyoun <sahyoun@fileaffairs.de> a écrit :
>
> > if you need further assistance please let us know.
> >
> > BR
> > Maruan Sahyoun
> >
> > Am 07.03.2014 um 18:24 schrieb HQS <hqsoftwares@gmail.com>:
> >
> >> Thank you all for those accurate answers.
> >> I will give a try to the geometrical approach based on the (x, y)
> coordinates of the characters.
> >>
> >> Best regards,
> >>
> >> Julien
> >>
> >> Le 7 mars 2014 à 13:25, Peter Murray-Rust <pm286@cam.ac.uk> a écrit
:
> >>
> >>> On Fri, Mar 7, 2014 at 11:16 AM, Confidential Confidential <
> >>> hqsoftwares@gmail.com> wrote:
> >>>
> >>>> Sirs,
> >>>>
> >>>> I had already thought about this graphical approach to reconstruct the
> >>>> words. I've let it down because I'm a bit sceptical on the
> reliability of
> >>>> such a method. I can't help thinking that it will not be a 100% sure
> >>>> method. I understand why a CAD software would produce such an output,
> >>>> though (thank you for this new word that I didn't know
> "boustrophedonic",
> >>>> but it explains well the result obtained).
> >>>>
> >>>
> >>> It's not as bad as you think. We have re-constructed the text from
> hundreds
> >>> of scientific papers (so probably nearly a million words) and found
> very
> >>> few problems. The reason we are doing this rather than using PDFBox
> tools
> >>> is that scientific (and especially maths) PDFs contain may diacritics,
> high
> >>> Unicode points, occasional graphics strokes, variable font size and
> style,
> >>> ligatures, non-horizontal text, etc.
> >>>
> >>> For running text it works very well - assuming that the characters
> announce
> >>> their widths. Then - roughly - "ab" is a word if
> >>>
> >>> x(a) + width(a)*fontSize(a) + tolerance >= x(b)
> >>>
> >>> else we can *crudely* estimate the number of intervening spaces (this
> is
> >>> very suspect as publishers may elide concatenated spaces).
> >>>
> >>> All standard Fonts (see PDF spec) should announce their widths.
> >>> Unfortunately scientific publishers use some of the worst constructed
> fonts
> >>> in the world and sometimes we have to guess - by surveying a body of
> >>> character positions and trying to work out spaces and font-type.
> >>>
> >>>
> >>>> Supposing that the characters appear in a totally arbitrary order,
> >>>> detecting that they're on the same line is more or less piece of cake
> >>>> (except if I need to introduce a tolerance, which makes things more
> >>>> difficult),
> >>>
> >>>
> >>> In a modern PDF we find that all characters on the same line tend to
> have
> >>> equal y-coords to at least 3 decimals. The problem is that OCR'ed
> >>> characters may have variable y because of rounding errors and
> antialiasing.
> >>>
> >>>
> >>>
> >>>> but grouping the characters according to their X position is
> >>>> not at all an easy task.
> >>>>
> >>>
> >>> The order should be fairly clear. The problems are:
> >>> * spaces (see above)
> >>> * hyphens at line-end (this requires heuristics - maybe lookup in
> Wordnet)
> >>> - we generally solve > 90%. Hyphens in chemistry are meaningful
> >>> * diacritics. Some characters have diacritics with the same x (e.g. E
> and
> >>> acute). These can occur in variable order. Where possible we try to
> >>> recreate a single Unicode point.
> >>> * over and underbars
> >>> * ligatures (in "waffle") their may be 6 characters or only 4
> w-a-ffl-e. We
> >>> split the latter.
> >>>
> >>>
> >>>>
> >>>> But this is not an issue, my problem is more the fact that this
> method may
> >>>> not be 100% reliable. What do you think ?
> >>>>
> >>>
> >>> We are committed to solving it for English-language science and
> European
> >>> personal names. The worst case is probably slanted text in diagrams.
> >>>
> >>>
> >>>>
> >>>> As for the technical part (overloading the processText), it's ok,
> thanks
> >>>> for the advice.
> >>>>
> >>>> Best regards
> >>>>
> >>>> Julien
> >>>>
> >>>>
> >>>>
> >>>> --
> >>> Peter Murray-Rust
> >>> Reader in Molecular Informatics
> >>> Unilever Centre, Dep. Of Chemistry
> >>> University of Cambridge
> >>> CB2 1EW, UK
> >>> +44-1223-763069
> >>
> >
>
>


-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message