pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rodrigo Caniçali <rodrigo.canic...@yahoo.com.br>
Subject Re: WARNING: Did not found XRef object at specified startxref position
Date Mon, 04 Nov 2013 12:24:41 GMT
Hi Thomas,

Below is the stacktrace when the option “-nonSeq” is enabled:

Loading PDF D:\Documents and Settings\05215385726\Meus documentos\rpf_tributos.pdf
Exception in thread "main" java.io.IOException: Error: Expected a long type, actual='!@:g8lJLDX5I'H%oMioAqC?O$d[,X]%dZ#a?Wos'
at org.apache.pdfbox.pdfparser.BaseParser.readLong(BaseParser.java:1668)
at org.apache.pdfbox.pdfparser.BaseParser.readObjectNumber(BaseParser.java:1598)
at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseXrefObjStream(NonSequentialPDFParser.java:460)
at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.initialParse(NonSequentialPDFParser.java:358)
at org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parse(NonSequentialPDFParser.java:702)
at org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1139)
at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:208)
at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)


When that option is disabled, the following warnings are printed on Eclipse console and some
text of PDF document is not extracted:

Loading PDF D:\Documents and Settings\05215385726\Meus documentos\rpf_tributos.pdf
Nov 04, 2013 10:16:13 AM org.apache.pdfbox.pdfparser.XrefTrailerResolver setStartxref
WARNING: Did not found XRef object at specified startxref position 52779
Time for loading: 0.125 seconds
Starting text extraction
Writing to D:\Documents and Settings\05215385726\Meus documentos\rpf_tributos.txt
Nov 04, 2013 10:16:14 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: o
Nov 04, 2013 10:16:14 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: Os
Nov 04, 2013 10:16:14 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: a
Nov 04, 2013 10:16:14 AM org.apache.pdfbox.util.PDFStreamEngine processOperator
INFO: unsupported/disabled operation: su

Thanks,

Rodrigo



Em Sábado, 2 de Novembro de 2013 10:24, Rodrigo Caniçali <rodrigo.canicali@yahoo.com.br>
escreveu:
 
Hi Thomas,

Thanks for your answer.

I am afraid the document is confidential, but I canprovide the stacktrace and find out if
it is possible to generate a non-confidential example on Monday when I will be at the office
again.

Best regards,
Rodrigo





Em Sábado, 2 de Novembro de 2013 5:50, Thomas Chojecki <info@rayman2200.de> escreveu:


Zitat von Rodrigo Caniçali <rodrigo.canicali@yahoo.com.br>:

> Hi,
Hi Rodrigo,

> I found on a mailing list of 2012-jun-14 that this problem has been  
> already discussed, but here is pretty different.
I think I found the discussion.

> I also get the warning "Did not found XRef object at specified  
> startxref position xxx" when executing the main function  
> of org.apache.pdfbox.ExtractText class. However, some PDF texts are  
> ignored and are not printed on the output TXT file. These same texts  
> are displayed by Acrobat Reader and can be copyed by the user as  
> texts from this program.

Your document is broken and it work with Acrobat Reader, because he  
isn't strict enough against the specification.

Many developer that try to create a pdf writer, test it against the  
Acrobat Reader and does not follow always the specification. So the  
reference is to create Acrobat Reader and not specification conformant  
documents. This lead to the problem that 3rd party libraries like  
pdfbox can't sometimes parse such documents.

In your case the xref table isn't there, where the parser supposing  
it. If you can provide use such document, we can try to find the cause  
of the problem and maybe fixing it.

>
> If the option "-nonSeq" is selected, then appears a  
> "java.io.IOException: Error: Expected a long type, actual=..." which  
> stops the text extraction.
Maybe you can post the first three lines from the stacktrace, this  
will help debugging the problem.

> Please, is there any way to make it work?
It is nearly impossible reconstructing such cases. If you can provide  
us more informations or maybe the document, it will help use improving  
the parser, if possible.

We do our best to support as many document as we can, but in some  
cases we need to be strict to support the existing fine parsing  
documents. This problem is also one point on the agenda of the pdfbox  
2.0.0 version.


>
> Thanks,
>
> Rodrigo

Best regards
Thomas
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message