pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hawkins, Thomas A. - Student" <thawk...@midway.edu>
Subject RE: PDFBox and superscript format .NET
Date Fri, 18 May 2012 05:37:51 GMT
As an addendum, I didn't realize when I sent this out - the numbers are a combination of regular
and superscript, since email won't support it, mathematical operators it is. The numbers should
be
8^5       (INSTEAD OF 85)
9^6       (INSTEAD OF 96)
4^7       (INSTEAD OF 47)
10^4     (INSTEAD OF 104)
________________________________________
From: Hawkins, Thomas A. - Student [thawkins@midway.edu]
Sent: Friday, May 18, 2012 1:21 AM
To: users@pdfbox.apache.org
Subject: PDFBox and superscript format .NET

I am using the .NET version of PDFBox and I have a pdf that contains data such as this:

Name                  Location
Jim Daviees              85
Herschel Walker          96
Vince Gogh               47
Andrew Lincoln        104

I need both the name value and the location value. When I use the following code:

    Dim p As PDDocument = PDDocument.load(fi.FullName)
                    Dim r As PDFTextStripper = New PDFTextStripper

                    Dim stringVal As String = r.getText(p)
                    Dim bytes As Byte() = System.Text.Encoding.ASCII.GetBytes(stringVal)

I get the following in the .txt file (also in html when I've converted it to that)
Jim Daviees
Herschel Walker
Vince Gogh
Andrew Lincoln
85
96
47
104

I'm okay with the layout, as I've got a work around for that, my problem is that it destroys
any mention of the superscript exponents. Is there a way that I can locate these superscript
parts and encapsulate them in brackets or something so as the returned value is more like
this:
Jim Daviees
Herschel Walker
Vince Gogh
Andrew Lincoln
8[5]
9[6]
4[7]
10[4]

So, nutshell time. Can I use pdfbox (.NET Version) to locate the instances of superscript
in a pdf file (like locating <sup></sup> in html) and change it out for an easily
recognized symbol to be output to my destination file. I picked brackets because I have no
brackets in my source file whatsoever and they would be very easy for me to code around. Thanks
in advance.

Mime
View raw message