pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Holsman <kry...@gmail.com>
Subject Re: PDFBox and superscript format .NET
Date Fri, 18 May 2012 22:07:35 GMT
no idea about examples

look at implementing endPage() and doing something like:

for (List<TextPosition> aCharactersByArticle : charactersByArticle) {
 for (TextPosition t : aCharactersByArticle) {
 }
}

On May 19, 2012, at 3:54 AM, Hawkins, Thomas A. - Student wrote:

> Any idea as to where I might go for some examples of the textposition class - I've searched
the docs and found nothing. Looking over the old threads, I've only found people with issues
in regards to textposition. This sounds perfect as to what I need, I just need to figure out
how to use it (ie get the x,y and iterate through them)
> 
> Thank you.
> ________________________________________
> From: Ian Holsman [kryton@gmail.com]
> Sent: Friday, May 18, 2012 3:46 AM
> To: users@pdfbox.apache.org
> Cc: users@pdfbox.apache.org
> Subject: Re: PDFBox and superscript format .NET
> 
> You might want to look at the process operator function and watching for tj&ts operators.
Ts is the super/subscript operator which might give you the information you need. If you track
the textposition class it should give you the x,y position if the lettering.
> Sadly it's harder than it sounds :(
> (I'm a newbie so I might be completely off base)
> 
> Sent from my iPhone
> 
> On 18/05/2012, at 3:37 PM, "Hawkins, Thomas A. - Student" <thawkins@midway.edu>
wrote:
> 
>> As an addendum, I didn't realize when I sent this out - the numbers are a combination
of regular and superscript, since email won't support it, mathematical operators it is. The
numbers should be
>> 8^5       (INSTEAD OF 85)
>> 9^6       (INSTEAD OF 96)
>> 4^7       (INSTEAD OF 47)
>> 10^4     (INSTEAD OF 104)
>> ________________________________________
>> From: Hawkins, Thomas A. - Student [thawkins@midway.edu]
>> Sent: Friday, May 18, 2012 1:21 AM
>> To: users@pdfbox.apache.org
>> Subject: PDFBox and superscript format .NET
>> 
>> I am using the .NET version of PDFBox and I have a pdf that contains data such as
this:
>> 
>> Name                  Location
>> Jim Daviees              85
>> Herschel Walker          96
>> Vince Gogh               47
>> Andrew Lincoln        104
>> 
>> I need both the name value and the location value. When I use the following code:
>> 
>>   Dim p As PDDocument = PDDocument.load(fi.FullName)
>>                   Dim r As PDFTextStripper = New PDFTextStripper
>> 
>>                   Dim stringVal As String = r.getText(p)
>>                   Dim bytes As Byte() = System.Text.Encoding.ASCII.GetBytes(stringVal)
>> 
>> I get the following in the .txt file (also in html when I've converted it to that)
>> Jim Daviees
>> Herschel Walker
>> Vince Gogh
>> Andrew Lincoln
>> 85
>> 96
>> 47
>> 104
>> 
>> I'm okay with the layout, as I've got a work around for that, my problem is that
it destroys any mention of the superscript exponents. Is there a way that I can locate these
superscript parts and encapsulate them in brackets or something so as the returned value is
more like this:
>> Jim Daviees
>> Herschel Walker
>> Vince Gogh
>> Andrew Lincoln
>> 8[5]
>> 9[6]
>> 4[7]
>> 10[4]
>> 
>> So, nutshell time. Can I use pdfbox (.NET Version) to locate the instances of superscript
in a pdf file (like locating <sup></sup> in html) and change it out for an easily
recognized symbol to be output to my destination file. I picked brackets because I have no
brackets in my source file whatsoever and they would be very easy for me to code around. Thanks
in advance.


Mime
View raw message