Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm
Precedence: bulk
Reply-To: users@pdfbox.apache.org
Received-SPF: pass (athena.apache.org: domain of sivakumarch51@gmail.com
 designates 209.85.212.169 as permitted sender)
MIME-Version: 1.0
Date: Fri, 28 Mar 2014 14:23:16 -0400
Message-ID: 
 <CALDBpde8X6foWdk737YyK1ayoWDDfHH4FqqucbBinczvWYn8yQ@mail.gmail.com>
Subject: Eliminating super scripts while extracting text from pdf
From: Siva Kumar Ch <sivakumarch51@gmail.com>
To: users@pdfbox.apache.org
Content-Type: multipart/alternative; boundary=089e0117643da16b3104f5aecbf9

--089e0117643da16b3104f5aecbf9
Content-Type: text/plain; charset=ISO-8859-1

Hi,

I am trying to extract text from pdf, and process the text. I have been
successful in extraction, but could not get much benefits out of it as the
extracted text treated the superscripts, usually numbers, as normal text.

A superscript to a word, which is the last word of a sentence, has been
placed after the period(.)

ex: Word: "test" with superscript "super"
When it appeared at the end of a sentence, has been extracted as -
"test.super"

Is there any way I can get rid of superscripts?

-- 
Br,
Siva.

--089e0117643da16b3104f5aecbf9--