pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maruan Sahyoun <sahy...@fileaffairs.de>
Subject Re: WARNING: Did not found XRef object at specified startxref position
Date Wed, 20 Nov 2013 18:39:56 GMT
Hi,

PDFBox targets ISO-32000.

BR

Maruan

> Am 20.11.2013 um 19:29 schrieb Rodrigo Caniçali <rodrigo.canicali@yahoo.com.br>:
> 
> Thomas,
> 
> I found several PDF specifications on the net.
> 
> Please, which is the PDF specification followed by PDFBOX library.
> 
> Thanks,
> 
> Rodrigo
> 
> 
> 
> Em Quinta-feira, 14 de Novembro de 2013 11:30, Rodrigo Caniçali <rodrigo.canicali@yahoo.com.br>
escreveu:
> 
> Hi Thomas,
> 
> There is no such object at the whole document. Looking for the keyword "/XRef" or "80
0", the editor cannot find them anywhere. However I could find at the end of the document
the following code:
> 
> xref
> 0 47
> 0000000000 65535 f 
> 0000000009 00000 n 
> 0000052584 00000 n 
> 0000052633 00000 n 
> 0000009275 00000 n 
> 0000000199 00000 n 
> 0000003543 00000 n 
> ....
> 0000052345 0000 n 
> 
> trailer
> <<
> 
> /Size 47
> /Root 2 0 R
> /Info 1 0 R
> startxref
> 52279
> %%EOF
> 
> Changing the reference 52279 by 53730 which is the address of "xref", it seems that the
xref table position error has been solved. 
> 
> But the following warning is still been displayed and some text are still not been extracted:
> 
> Loading PDF D:\Documents and Settings\05215385726\rpf_tributos.pdf
> Time for loading: 0.094 seconds
> Starting text extraction
> Writing to D:\Documents and Settings\05215385726\rpf_tributos.txt
> Nov 14, 2013 11:24:36 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
> INFO: unsupported/disabled operation: o
> Nov 14, 2013 11:24:36 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
> INFO: unsupported/disabled operation: Os
> Nov 14, 2013 11:24:36 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
> INFO: unsupported/disabled operation: a
> Nov 14, 2013 11:24:36 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
> INFO: unsupported/disabled operation: su
> 
> 
> Also, with the "-nonSeq" option enabled, the error below is displayed:
> 
> Loading PDF D:\Documents and Settings\05215385726\Meus documentos\rpf_tributos.pdf
> Exception in thread "main" java.io.IOException: Error: Expected a long type, actual='K`_'
> at org.apache.pdfbox.pdfparser.BaseParser.readLong(BaseParser.java:1668)
> at org.apache.pdfbox.pdfparser.BaseParser.readObjectNumber(BaseParser.java:1598)
> at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseObjectDynamically(NonSequentialPDFParser.java:1183)
> at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseObjectDynamically(NonSequentialPDFParser.java:1130)
> at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.initialParse(NonSequentialPDFParser.java:420)
> at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parse(NonSequentialPDFParser.java:702)
> at org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1139)
> at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:208)
> at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)
> 
> 
> I wonder if I could write a routine to fix a document like this before parsing it with
PDFbox, since it can be parsed by Acrobat Reader.
> 
> Thanks,
> 
> Rodrigo
> 
> 
> 
> Em Quarta-feira, 13 de Novembro de 2013 19:49, Thomas Chojecki <info@rayman2200.de>
escreveu:
> 
> Hi Rodrigo,
> it look like the startxref position (52779) is wrong and point into a  
> stream instead at the beginning of a xref table or stream. The value  
> inside the exception shows a compressed string and it might be the  
> xref stream.
> 
> You can open a hex editor and jump directly to the position 52779 and  
> look for a object that may look like
> 
> ,---
> 
> 80 0 obj <<
> /Type /XRef
> /Index [0 424]
> /Size 424
> /W [1 3 1]
> /Root 421 0 R
> /Info 422 0 R
> /ID [<14895AE8C3218939710EBBFF5EAD0E28> <14895AE8C3218939710EBBFF5EAD0E28>]
> /Length 1073
> /Filter /FlateDecode
> stream
> ...
> endstream
> endobj
> 
> `---
> 
> If you find this object with the /Type /XRef you can go to the  
> beginning of it, in this case the 80 0 obj and write down the position  
> of this object. Then you can go to the end of the file and overwrite  
> the startxref 52779 position with you marked position and try to parse  
> the document again.
> 
> This should work and indicate that the pdf creator you are using,  
> creates wrong object positions. Pdfbox can parse only documents that  
> provide correct xref tables / streams, otherwise the parser does not  
> know how to handle the document.
> 
> Best regards
> Thomas
> 
> 
> 
> Zitat von Rodrigo Caniçali <rodrigo.canicali@yahoo.com.br>:
> 
>> Hi Thomas,
>> 
>> Below is the stacktrace when the option “-nonSeq” is enabled:
>> 
>> Loading PDF D:\Documents and Settings\05215385726\Meus  
>> documentos\rpf_tributos.pdf
>> Exception in thread "main" java.io.IOException: Error: Expected a  
>> long type, actual='!@:g8lJLDX5I'H%oMioAqC?O$d[,X]%dZ#a?Wos'
>> at org.apache.pdfbox.pdfparser.BaseParser.readLong(BaseParser.java:1668)
>> at  
>> org.apache.pdfbox.pdfparser.BaseParser.readObjectNumber(BaseParser.java:1598)
> at  
>> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseXrefObjStream(NonSequentialPDFParser.java:460)
>> at  
>> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.initialParse(NonSequentialPDFParser.java:358)
>> at  
>> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parse(NonSequentialPDFParser.java:702)
>> at org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1139)
>> at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:208)
>> at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)
>> 
>> 
>> When that option is disabled, the following warnings are printed on  
>> Eclipse console and some text of PDF document is not extracted:
>> 
>> Loading PDF
> D:\Documents and Settings\05215385726\Meus  
>> documentos\rpf_tributos.pdf
>> Nov 04, 2013 10:16:13 AM  
>> org.apache.pdfbox.pdfparser.XrefTrailerResolver setStartxref
>> WARNING: Did not found XRef object at specified startxref position 52779
>> Time for loading: 0.125 seconds
>> Starting text extraction
>> Writing to D:\Documents and Settings\05215385726\Meus  
>> documentos\rpf_tributos.txt
>> Nov 04, 2013 10:16:14 AM org.apache.pdfbox.util.PDFStreamEngine  
>> processOperator
>> INFO: unsupported/disabled operation: o
>> Nov 04, 2013 10:16:14 AM org.apache.pdfbox.util.PDFStreamEngine  
>> processOperator
>> INFO: unsupported/disabled operation: Os
>> Nov 04,
> 2013 10:16:14 AM org.apache.pdfbox.util.PDFStreamEngine  
>> processOperator
>> INFO: unsupported/disabled operation: a
>> Nov 04, 2013 10:16:14 AM org.apache.pdfbox.util.PDFStreamEngine  
>> processOperator
>> INFO: unsupported/disabled operation: su
>> 
>> Thanks,
>> 
>> Rodrigo
>> 
>> 
>> 
>> Em Sábado, 2 de Novembro de 2013 10:24, Rodrigo Caniçali  
>> <rodrigo.canicali@yahoo.com.br> escreveu:
>> 
>> Hi Thomas,
>> 
>> Thanks for your answer.
>> 
>> I am afraid the document
> is confidential, but I canprovide the  
>> stacktrace and find out if it is possible to generate a  
>> non-confidential example on Monday when I will be at the office again.
>> 
>> Best regards,
>> Rodrigo
>> 
>> 
>> 
>> 
>> 
>> Em Sábado, 2 de Novembro de 2013 5:50, Thomas Chojecki  
>> <info@rayman2200.de> escreveu:
>> 
>> 
>> Zitat von Rodrigo Caniçali <rodrigo.canicali@yahoo.com.br>:
>> 
>>> Hi,
>> Hi
> Rodrigo,
>> 
>>> I found on a mailing list of 2012-jun-14 that this problem has been 
>>> already discussed, but here is pretty different.
>> I think I found the discussion.
>> 
>>> I also get the warning "Did not found XRef object at specified 
>>> startxref position xxx" when executing the main function 
>>> of org.apache.pdfbox.ExtractText class. However, some PDF texts are 
>>> ignored and are not printed on the output TXT file. These same texts 
>>> are displayed by Acrobat Reader and can be copyed by the user as 
>>> texts from this program.
>> 
>> Your document is broken and it work with Acrobat Reader, because he 
>> isn't
> strict enough against the specification.
>> 
>> Many developer that try to create a pdf writer, test it against the 
>> Acrobat Reader and does not follow always the specification. So the 
>> reference is to create Acrobat Reader and not specification conformant 
>> documents. This lead to the problem that 3rd party libraries like 
>> pdfbox can't sometimes parse such documents.
>> 
>> In your case the xref table isn't there, where the parser supposing 
>> it. If you can provide use such document, we can try to find the cause 
>> of the problem and maybe fixing it.
>> 
>>> 
>>> If the option "-nonSeq" is selected, then appears a 
>>> "java.io.IOException: Error:
> Expected a long type, actual=..." which 
>>> stops the text extraction.
>> Maybe you can post the first three lines from the stacktrace, this 
>> will help debugging the problem.
>> 
>>> Please, is there any way to make it work?
>> It is nearly impossible reconstructing such cases. If you can provide 
>> us more informations or maybe the document, it will help use improving 
>> the parser, if possible.
>> 
>> We do our best to support as many document as we can, but in some 
>> cases we need to be strict to support the existing fine parsing 
>> documents. This problem is also one point on the agenda of the pdfbox 
>> 2.0.0 version.
>> 
>> 
>>> 
>>> Thanks,
>>> 
>>> Rodrigo
>> 
>> Best regards
>> Thomas

Mime
View raw message