Return-Path: X-Original-To: apmail-pdfbox-users-archive@www.apache.org Delivered-To: apmail-pdfbox-users-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4288611ACE for ; Fri, 28 Mar 2014 21:47:33 +0000 (UTC) Received: (qmail 34447 invoked by uid 500); 28 Mar 2014 21:47:33 -0000 Delivered-To: apmail-pdfbox-users-archive@pdfbox.apache.org Received: (qmail 33990 invoked by uid 500); 28 Mar 2014 21:47:30 -0000 Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@pdfbox.apache.org Delivered-To: mailing list users@pdfbox.apache.org Received: (qmail 33979 invoked by uid 99); 28 Mar 2014 21:47:28 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 28 Mar 2014 21:47:28 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_NONE,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy includes SPF record at spf.trusted-forwarder.org) Received: from [80.67.31.95] (HELO smtprelay06.ispgateway.de) (80.67.31.95) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 28 Mar 2014 21:47:22 +0000 Received: from [91.61.81.37] (helo=[192.168.2.100]) by smtprelay06.ispgateway.de with esmtpsa (TLSv1:AES128-SHA:128) (Exim 4.68) (envelope-from ) id 1WTec8-00045G-Cw; Fri, 28 Mar 2014 22:47:00 +0100 Content-Type: text/plain; charset=iso-8859-1 Mime-Version: 1.0 (Mac OS X Mail 6.6 \(1510\)) Subject: Re: Eliminating super scripts while extracting text from pdf From: =?iso-8859-1?Q?Olaf_Dr=FCmmer?= In-Reply-To: Date: Fri, 28 Mar 2014 22:47:02 +0100 Cc: =?iso-8859-1?Q?Olaf_Dr=FCmmer?= Content-Transfer-Encoding: quoted-printable Message-Id: <1E179B41-B130-4E7F-A253-C113C91681B7@callassoftware.com> References: To: users@pdfbox.apache.org X-Mailer: Apple Mail (2.1510) X-Df-Sender: b2xhZmxpc3RAY2FsbGFzc29mdHdhcmUuY29t X-Virus-Checked: Checked by ClamAV on apache.org Two thoughts: - keep track of the baseline and size of characters, if the baseline is = slightly shifted (upwards -> superscript, downward -> subscript) and the = size is smaller than surrounding characters, it's possibly a superscript = or subscript character - be aware of the fact that some fonts contain glyphs for superscripts - = then baseline and text size would be the same; in such cases you'd have = to look up via the Unicode code point whether you have encountered a = superscript. Olaf Am 28 Mar 2014 um 19:23 schrieb Siva Kumar Ch : > Hi, >=20 > I am trying to extract text from pdf, and process the text. I have = been > successful in extraction, but could not get much benefits out of it as = the > extracted text treated the superscripts, usually numbers, as normal = text. >=20 > A superscript to a word, which is the last word of a sentence, has = been > placed after the period(.) >=20 > ex: Word: "test" with superscript "super" > When it appeared at the end of a sentence, has been extracted as - > "test.super" >=20 > Is there any way I can get rid of superscripts? >=20 > --=20 > Br, > Siva.