pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ernesto De Santis <desantiserne...@yahoo.com.ar>
Subject PDF file created with LaTeX is bad parsed
Date Sat, 12 Dec 2009 21:04:22 GMT
I worked on the issue I created time ago 'PDF file created with LaTeX is bad parsed', I hope
it's a bad font encoding detection, or an unsupported encoding.
https://issues.apache.org/jira/browse/PDFBOX-534

The problem descripcion:
-------------------------------------------------------
I'm getting an unexpected behavior parsing a pdf file.

I'm trying to get the clean body text of some file, and I get a lot of
aXX strings. Where each X is a number. It appear be the char code of
the real character, I don't know really.

My code is too simple:

   String[] args = {"/home/ernesto/tesis/documento/kvfs.pdf"};
   ExtractText.main(args);


I used the PDFBox 0.8.0-incubator version. Builded on 12/12/2009. 

The output I get is:
a73a109a112a108a101a109a101a110a116a97a110a100a111 a97a99a99a101a115a111 a97 a115a105a115a116a101a109a97a115
a100a101
a97a114a99a104a105a118a111a115 a118a105a114a116a117a97a108a101a115
a112a97a114a97 a108a97 a104a101a114a114a97a109a105a101a110a116a97
a100a101 a98a250a115a113a117a101a100a97 a75a110a101a111a98a97a115a101
and more ......
-----------------------------------------------------------------

Now, I debugged, and test some alternatives:


I found the cause of the problem, but not the solution. 

It's a bad font encoding detection, or an unsupported encoding. 

Debugging the pdfbox classes I found in the lines that encode the characters, when the character
is wrong read. Look this lines:
Class PDFont, Method String encode( byte[] c, int offset, int length ), line 438.

438            Encoding encoding = getEncoding();
439            if( encoding != null)
440            {
441                retval = encoding.getCharacter( getCodeFromArray( c, offset, length ) );
442            }
443            if( retval == null )
444            {
445                retval = getStringFromArray( c, offset, length );
446            }

The first line, method getEncoding() return a org.apache.pdfbox.encoding.DictionaryEncoding,
then go into the if (439), and getCharacter method return a aXX character. The second if(443)
is disconsidered, but I evaluated the getStringFromArray method and it return a beautiful
normal character like 'i'. 

Then I tried two ways, understand what is wrong with my font encoding and who is generating
it. My pdf is generated by a latex, and I found for European accented character is used a
package \usepackage[T1]{fontenc}, I'm using it. I take off this line from my latex source
file, and generate the pdf again. When ran the pdfbox text again, I got a better result: 

Implementando acceso a sistemas de
archivos virtuales para la herramienta
de b usqueda Kneobase
Alumno: Ernesto De Santis
Director: Pablo Ernesto Mart  nez L opez

But WITHOUT the accented characters. 

Then, I tried to use the getStringFromArray instead of encoding.getCharacter in the pdfbox
source, backing the latex source as the original one. I did it, but the result was similar,
bad accented characters:

Implementando acceso a sistemas de
archivos virtuales para la herramienta
de b?squeda Kneobase
Alumno: Ernesto De Santis
Director: Pablo Ernesto Mart?nez L?pez 

-- 
Blog de nuestras vidas en Rio de Janeiro (Fernanda y Ernesto):
http://www.fernandayernesto.blogspot.com/



Mime
View raw message