pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maruan Sahyoun <sahy...@fileaffairs.de>
Subject Re: 2 questions
Date Sun, 09 Mar 2014 11:26:00 GMT
The factor of 1000 is defined in the PDF specification and is to map from Glyph Space to Text
Space. Maybe you should take a look in chap 9.1 - 9.4 of the ISO 32000 spec.

BR
Maruan Sahyoun

Am 08.03.2014 um 18:23 schrieb HQS <hqsoftwares@gmail.com>:

> Peter,
> 
> What you said about the factor 1000 I’ve seen it on a website dealing with PDFBox so
you might be right.
> I have tried the following assertion which, if true, makes 2 characters connected to
the same word :
> 
> leftChar.getX() + leftChar.getWidth() + space * .5f + X_TOLERANCE >= rightChar.getX()
> 
> I tried with X_TOLERANCE = 0
> 
> space is simply equal to leftChar.getWidthOfSpace() , a method in the TextPosition class.
> getWidth() is also a method of that class.
> 
> The first results are very satisfying.
> 
> By the way, is there an « easy » way to delete text from a PDF, apart from parsing
the tokens
> and delete those preceding the « Tj » / « TJ » operators ? I need this to erase the
reference strings
> that I have detected and create an hyperlink at the same location with the same font.
> 
> When I’ve tested the PDF words extractor I will post the source code so that we can
share our technics.
> The extractor I’m making is a bit more advanced than the one embedded in PDFBox as
it creates a list of
> couples (XY position of a word, contents of a word) and not just give the list of words.
> 
> Thanks all !
> 
> Julien
> 
> 
> Le 8 mars 2014 à 15:14, Peter Murray-Rust <pm286@cam.ac.uk> a écrit :
> 
>> The width appears to be a ratio, independent of size. It also seems to be
>> conventionally multiplied by 1000 (I have not found a definition for this -
>> I have only guessed it).
>> 
>> Thus a character "A" of width=600 and fontSize=10.5 appears to have
>> pixelwidth = 600. / 1000. * 10.5 = 6.3 pixels
>> 
>> I'd be grateful for confirmation or correction...
>> 
>> 
>> On Sat, Mar 8, 2014 at 11:12 AM, HQS <hqsoftwares@gmail.com> wrote:
>> 
>>> Well, I have a precision to ask to Peter, about this formula :
>>> 
>>> x(a) + width(a)*fontSize(a) + tolerance >= x(b)
>>> 
>>> What is the difference between « width(a) » and « fontSize(a) » ? Is it
>>> not enough
>>> to know the width of the character « a » in pixels given by the font, to
>>> check this assertion ?
>>> 
>>> Thanks !
>>> 
>>> 
>>> Le 7 mars 2014 à 18:46, Maruan Sahyoun <sahyoun@fileaffairs.de> a écrit
:
>>> 
>>>> if you need further assistance please let us know.
>>>> 
>>>> BR
>>>> Maruan Sahyoun
>>>> 
>>>> Am 07.03.2014 um 18:24 schrieb HQS <hqsoftwares@gmail.com>:
>>>> 
>>>>> Thank you all for those accurate answers.
>>>>> I will give a try to the geometrical approach based on the (x, y)
>>> coordinates of the characters.
>>>>> 
>>>>> Best regards,
>>>>> 
>>>>> Julien
>>>>> 
>>>>> Le 7 mars 2014 à 13:25, Peter Murray-Rust <pm286@cam.ac.uk> a
écrit :
>>>>> 
>>>>>> On Fri, Mar 7, 2014 at 11:16 AM, Confidential Confidential <
>>>>>> hqsoftwares@gmail.com> wrote:
>>>>>> 
>>>>>>> Sirs,
>>>>>>> 
>>>>>>> I had already thought about this graphical approach to reconstruct
the
>>>>>>> words. I've let it down because I'm a bit sceptical on the
>>> reliability of
>>>>>>> such a method. I can't help thinking that it will not be a 100%
sure
>>>>>>> method. I understand why a CAD software would produce such an
output,
>>>>>>> though (thank you for this new word that I didn't know
>>> "boustrophedonic",
>>>>>>> but it explains well the result obtained).
>>>>>>> 
>>>>>> 
>>>>>> It's not as bad as you think. We have re-constructed the text from
>>> hundreds
>>>>>> of scientific papers (so probably nearly a million words) and found
>>> very
>>>>>> few problems. The reason we are doing this rather than using PDFBox
>>> tools
>>>>>> is that scientific (and especially maths) PDFs contain may diacritics,
>>> high
>>>>>> Unicode points, occasional graphics strokes, variable font size and
>>> style,
>>>>>> ligatures, non-horizontal text, etc.
>>>>>> 
>>>>>> For running text it works very well - assuming that the characters
>>> announce
>>>>>> their widths. Then - roughly - "ab" is a word if
>>>>>> 
>>>>>> x(a) + width(a)*fontSize(a) + tolerance >= x(b)
>>>>>> 
>>>>>> else we can *crudely* estimate the number of intervening spaces (this
>>> is
>>>>>> very suspect as publishers may elide concatenated spaces).
>>>>>> 
>>>>>> All standard Fonts (see PDF spec) should announce their widths.
>>>>>> Unfortunately scientific publishers use some of the worst constructed
>>> fonts
>>>>>> in the world and sometimes we have to guess - by surveying a body
of
>>>>>> character positions and trying to work out spaces and font-type.
>>>>>> 
>>>>>> 
>>>>>>> Supposing that the characters appear in a totally arbitrary order,
>>>>>>> detecting that they're on the same line is more or less piece
of cake
>>>>>>> (except if I need to introduce a tolerance, which makes things
more
>>>>>>> difficult),
>>>>>> 
>>>>>> 
>>>>>> In a modern PDF we find that all characters on the same line tend
to
>>> have
>>>>>> equal y-coords to at least 3 decimals. The problem is that OCR'ed
>>>>>> characters may have variable y because of rounding errors and
>>> antialiasing.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> but grouping the characters according to their X position is
>>>>>>> not at all an easy task.
>>>>>>> 
>>>>>> 
>>>>>> The order should be fairly clear. The problems are:
>>>>>> * spaces (see above)
>>>>>> * hyphens at line-end (this requires heuristics - maybe lookup in
>>> Wordnet)
>>>>>> - we generally solve > 90%. Hyphens in chemistry are meaningful
>>>>>> * diacritics. Some characters have diacritics with the same x (e.g.
E
>>> and
>>>>>> acute). These can occur in variable order. Where possible we try
to
>>>>>> recreate a single Unicode point.
>>>>>> * over and underbars
>>>>>> * ligatures (in "waffle") their may be 6 characters or only 4
>>> w-a-ffl-e. We
>>>>>> split the latter.
>>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> But this is not an issue, my problem is more the fact that this
>>> method may
>>>>>>> not be 100% reliable. What do you think ?
>>>>>>> 
>>>>>> 
>>>>>> We are committed to solving it for English-language science and
>>> European
>>>>>> personal names. The worst case is probably slanted text in diagrams.
>>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> As for the technical part (overloading the processText), it's
ok,
>>> thanks
>>>>>>> for the advice.
>>>>>>> 
>>>>>>> Best regards
>>>>>>> 
>>>>>>> Julien
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>> Peter Murray-Rust
>>>>>> Reader in Molecular Informatics
>>>>>> Unilever Centre, Dep. Of Chemistry
>>>>>> University of Cambridge
>>>>>> CB2 1EW, UK
>>>>>> +44-1223-763069
>>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> -- 
>> Peter Murray-Rust
>> Reader in Molecular Informatics
>> Unilever Centre, Dep. Of Chemistry
>> University of Cambridge
>> CB2 1EW, UK
>> +44-1223-763069
> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message