pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Holsman <kry...@gmail.com>
Subject Re: PDFBox and superscript format .NET
Date Fri, 18 May 2012 07:46:38 GMT
You might want to look at the process operator function and watching for tj&ts operators.
Ts is the super/subscript operator which might give you the information you need. If you track
the textposition class it should give you the x,y position if the lettering. 
Sadly it's harder than it sounds :(
(I'm a newbie so I might be completely off base)

Sent from my iPhone

On 18/05/2012, at 3:37 PM, "Hawkins, Thomas A. - Student" <thawkins@midway.edu> wrote:

> As an addendum, I didn't realize when I sent this out - the numbers are a combination
of regular and superscript, since email won't support it, mathematical operators it is. The
numbers should be
> 8^5       (INSTEAD OF 85)
> 9^6       (INSTEAD OF 96)
> 4^7       (INSTEAD OF 47)
> 10^4     (INSTEAD OF 104)
> ________________________________________
> From: Hawkins, Thomas A. - Student [thawkins@midway.edu]
> Sent: Friday, May 18, 2012 1:21 AM
> To: users@pdfbox.apache.org
> Subject: PDFBox and superscript format .NET
> I am using the .NET version of PDFBox and I have a pdf that contains data such as this:
> Name                  Location
> Jim Daviees              85
> Herschel Walker          96
> Vince Gogh               47
> Andrew Lincoln        104
> I need both the name value and the location value. When I use the following code:
>    Dim p As PDDocument = PDDocument.load(fi.FullName)
>                    Dim r As PDFTextStripper = New PDFTextStripper
>                    Dim stringVal As String = r.getText(p)
>                    Dim bytes As Byte() = System.Text.Encoding.ASCII.GetBytes(stringVal)
> I get the following in the .txt file (also in html when I've converted it to that)
> Jim Daviees
> Herschel Walker
> Vince Gogh
> Andrew Lincoln
> 85
> 96
> 47
> 104
> I'm okay with the layout, as I've got a work around for that, my problem is that it destroys
any mention of the superscript exponents. Is there a way that I can locate these superscript
parts and encapsulate them in brackets or something so as the returned value is more like
> Jim Daviees
> Herschel Walker
> Vince Gogh
> Andrew Lincoln
> 8[5]
> 9[6]
> 4[7]
> 10[4]
> So, nutshell time. Can I use pdfbox (.NET Version) to locate the instances of superscript
in a pdf file (like locating <sup></sup> in html) and change it out for an easily
recognized symbol to be output to my destination file. I picked brackets because I have no
brackets in my source file whatsoever and they would be very easy for me to code around. Thanks
in advance.

View raw message