pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Lehmkuehler <andr...@lehmi.de>
Subject Re: PDFBox View Post-Read, Pre-Conversion Stream
Date Sun, 20 May 2012 11:02:34 GMT
Hi,

Am 20.05.2012 01:31, schrieb Hawkins, Thomas A. - Student:
> I've asked this question a couple of times and I really need help - no one has really
 > given me any type of answer that I can use. I've had answers but they
 > point me in no positive direction.
No offense, but you have to be more patient, we are all volunteers ...

> I am converting pdf files to txt files (of course I lose the formatting),
 > but I get horrible results converting to html and even worse to XML.
>
> So what I want to do, is have the program either place a space between
 > superscript exponents, or, place exponents in brackets.
>
> Is there anyway for me to access the stream of data after the pdf is read,
 > but before it is converted to a string. If I can find a way to do this
 > then I can figure out how to edit the data to return the txt file I want.
It is not that easy.

- the information you are looking for is part of the so called contentstream
- that stream is processed within PDFStreamEngine#processStream [1]
- the main test-processing is done in PDFStreamEngine#processEncodedText
- the PDF-operator -> ProcessOperator mapping can be found here [2]
- the class TestPosition doesn't have any onformation about text features like 
superscript
- you might have a look at the pdf specs [3]


> I am using the .NET port of pdfBox and I would appreciate some
 > examples (preferably VB or C#) but Java was my first language and
 > I'm sure I can knock the dust off of my knowledge.
As it is complicated enough to implement this stuff in java, I guess
there won't be any approaches in VB or C#.

BR
Andreas Lehmkühler

[1] 
http://svn.apache.org/repos/asf/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/util/PDFStreamEngine.java
[2] 
http://svn.apache.org/repos/asf/pdfbox/trunk/pdfbox/src/main/resources/org/apache/pdfbox/resources/PDFTextStripper.properties
[3] http://www.adobe.com/de/devnet/pdf/pdf_reference.html

Mime
View raw message