pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Siva Kumar Ch <sivakumarc...@gmail.com>
Subject Eliminating super scripts while extracting text from pdf
Date Fri, 28 Mar 2014 18:23:16 GMT

I am trying to extract text from pdf, and process the text. I have been
successful in extraction, but could not get much benefits out of it as the
extracted text treated the superscripts, usually numbers, as normal text.

A superscript to a word, which is the last word of a sentence, has been
placed after the period(.)

ex: Word: "test" with superscript "super"
When it appeared at the end of a sentence, has been extracted as -

Is there any way I can get rid of superscripts?


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message