pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tilman Hausherr (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (PDFBOX-2048) TextExtraction only working after uncompressing with pdftk
Date Fri, 02 May 2014 08:41:18 GMT

    [ https://issues.apache.org/jira/browse/PDFBOX-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984050#comment-13984050
] 

Tilman Hausherr edited comment on PDFBOX-2048 at 5/2/14 8:41 AM:
-----------------------------------------------------------------

Change committed in the trunk in rev 1590873, and rev 1590874 in the 1.8 branch.

Jonas, you can find a new jar file at 
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox/1.8.6-SNAPSHOT/
within a few hours. However it will be a few months before this will be released officially.


was (Author: tilman):
Change committed in the trunk in rev 1590873, and rev 1590874 in the 1.8 branch.

Jonas, you can find a new jar file at 
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox/1.8.6-SNAPSHOT/
within a few hours. However it will be a few months before this will be released officially.

I will set to resolve after the release of 1.8.5 (which will not include this change, because
the cut was already done).

> TextExtraction only working after uncompressing with pdftk
> ----------------------------------------------------------
>
>                 Key: PDFBOX-2048
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2048
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing, Rendering, Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Tilman Hausherr
>            Assignee: Tilman Hausherr
>
> From Jonas Karlsson on the user list:
> ===
> We have a user with PDFs generated by a commercial transcription service.
> When we try to extract text from these pdfs, pdfbox returns a few empty
> lines. We get this result both from our own code, and when using the
> ExtractText command line tool
> If I specify the non-sequential parser, with the -nonSeq flag, the
> following error is produced:
> Apr 28, 2014 10:35:11 AM org.apache.pdfbox.pdfparser.NonSequentialPDFParser
> validateStreamLength
> SEVERE: The end of the stream doesn't point to the correct offset, using
> workaround to read the stream
> If I uncompress the file with pdftk, pdfbox is able to successfully extract
> the text.
> ===
> I have been given permission to attach the file "committers only". So don't pass it around,
avoid quoting details from the file. The file is also not rendering. The lengths of the streams
are 0.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message