pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonas Karlsson <thejo...@gmail.com>
Subject TextExtraction only working after uncompressing with pdftk
Date Mon, 28 Apr 2014 15:00:21 GMT
Hello,

We have a user with PDFs generated by a commercial transcription service.
When we try to extract text from these pdfs, pdfbox returns a few empty
lines. We get this result both from our own code, and when using the
ExtractText command line tool

If I specify the non-sequential parser, with the -nonSeq flag, the
following error is produced:

Apr 28, 2014 10:35:11 AM org.apache.pdfbox.pdfparser.NonSequentialPDFParser
validateStreamLength

SEVERE: The end of the stream doesn't point to the correct offset, using
workaround to read the stream


If I uncompress the file with pdftk, pdfbox is able to successfully extract
the text.

Is it possible to perform this same uncompression with pdfbox? When I try
the WriteDecodedDoc command, I get an error:

java.io.StreamCorruptedException: Error: data is null

 at org.apache.pdfbox.filter.LZWFilter.decode(LZWFilter.java:82)


The PDF looks like it has been generated by Aspose.Words for .NET 10.0.0.0
. Unfortunately, I'm not authorized to share the file.


I realize there is not a lot to go on in my description of the problem, but
I appreciate any suggestions.


Thanks!


_jonas

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message