pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Olaf Drümmer <olafl...@callassoftware.com>
Subject Re: Eliminating super scripts while extracting text from pdf
Date Fri, 28 Mar 2014 21:47:02 GMT
Two thoughts:

- keep track of the baseline and size of characters, if the baseline is slightly shifted (upwards
-> superscript, downward -> subscript) and the size is smaller than surrounding characters,
it's possibly a superscript or subscript character

- be aware of the fact that some fonts contain glyphs for superscripts - then baseline and
text size would be the same; in such cases you'd have to look up via the Unicode code point
whether you have encountered a superscript.

Olaf

Am 28 Mar 2014 um 19:23 schrieb Siva Kumar Ch <sivakumarch51@gmail.com>:

> Hi,
> 
> I am trying to extract text from pdf, and process the text. I have been
> successful in extraction, but could not get much benefits out of it as the
> extracted text treated the superscripts, usually numbers, as normal text.
> 
> A superscript to a word, which is the last word of a sentence, has been
> placed after the period(.)
> 
> ex: Word: "test" with superscript "super"
> When it appeared at the end of a sentence, has been extracted as -
> "test.super"
> 
> Is there any way I can get rid of superscripts?
> 
> -- 
> Br,
> Siva.


Mime
View raw message