Return-Path: X-Original-To: apmail-pdfbox-users-archive@www.apache.org Delivered-To: apmail-pdfbox-users-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8F9691153A for ; Fri, 28 Mar 2014 18:34:37 +0000 (UTC) Received: (qmail 59921 invoked by uid 500); 28 Mar 2014 18:34:36 -0000 Delivered-To: apmail-pdfbox-users-archive@pdfbox.apache.org Received: (qmail 57286 invoked by uid 500); 28 Mar 2014 18:34:30 -0000 Mailing-List: contact users-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@pdfbox.apache.org Delivered-To: mailing list users@pdfbox.apache.org Delivered-To: moderator for users@pdfbox.apache.org Received: (qmail 30612 invoked by uid 99); 28 Mar 2014 18:23:42 -0000 X-ASF-Spam-Status: No, hits=1.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of sivakumarch51@gmail.com designates 209.85.212.169 as permitted sender) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=NVA7bz6xJXD+0SBPYSVZwHRET/h9YgLXPKiePSdd5z0=; b=f01NOPoBFok2YkAcn2jWY6vHOkEWg1ZtXoz9HIOG7v3fjLbZyuIq0Zy9tIdTqiwvTK AeG+LPoO5VUt2aLjqwWiv4dAnd5RXF9VlDIJEw6e7DExY0I5/p/8j3rO7RiVnRkXOY9W hsVj5tFF58gJNeSu3q0XhvRlaE3nV3tUuORFsG3wxHTqDrkZdVDZR1UqneQpUtkGUXTK OhGqFupW+CTvLWUAavixGui5FZC2og2K7D68rWWDtb7U8S7Nuo6A5+tLLz9nacgu8R92 +QqifBo2xLDssPTNRL7hT13JGzJCpp+TeiU8HyZHZL+N8NexO2psDQuzEF50v9oA4Tq8 Zsjw== MIME-Version: 1.0 X-Received: by 10.194.122.6 with SMTP id lo6mr201931wjb.38.1396030996309; Fri, 28 Mar 2014 11:23:16 -0700 (PDT) Date: Fri, 28 Mar 2014 14:23:16 -0400 Message-ID: Subject: Eliminating super scripts while extracting text from pdf From: Siva Kumar Ch To: users@pdfbox.apache.org Content-Type: multipart/alternative; boundary=089e0117643da16b3104f5aecbf9 X-Virus-Checked: Checked by ClamAV on apache.org --089e0117643da16b3104f5aecbf9 Content-Type: text/plain; charset=ISO-8859-1 Hi, I am trying to extract text from pdf, and process the text. I have been successful in extraction, but could not get much benefits out of it as the extracted text treated the superscripts, usually numbers, as normal text. A superscript to a word, which is the last word of a sentence, has been placed after the period(.) ex: Word: "test" with superscript "super" When it appeared at the end of a sentence, has been extracted as - "test.super" Is there any way I can get rid of superscripts? -- Br, Siva. --089e0117643da16b3104f5aecbf9--