pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From RENTON Scott <Scott.Ren...@ed.ac.uk>
Subject Re: Failures- corrupt PDFs?
Date Thu, 01 Jun 2017 10:25:43 GMT
Hi Maruan, thanks for the swift response. It looks like it’s 1.6.0 (quite old?)- that’s
certainly the .jar that’s sitting in the dspace lib directory. I’ve copied in George as
he’s investigating this too; George, I take it we’re ok to send Maruan a link to the relevant
records in the repository?

Cheers
Scott
-- 
Scott Renton

Digital Development
Library and University Collections
Argyle House, Floor F
ext: 515219








On 01/06/2017 11:18, "Maruan Sahyoun" <sahyoun@fileaffairs.de> wrote:

>Hi Scott,
>
>which version of PDFBox are you using? Is it possible to share one of the PDFs at a public
location?
>
>BR
>Maruan
>
>> Am 01.06.2017 um 12:11 schrieb RENTON Scott <Scott.Renton@ed.ac.uk>:
>> 
>> 
>> Hi folks (apologies- hit send too soon)
>> 
>> We run pdfbox for pdf text extraction under the Dspace application.
>> 
>> Occasionally we get the odd failure, and we’re investigating some errors just now.
I’m just wondering what property of the PDF in question it’s looking at here, and if there’s
any way we can mitigate against that. It’s certainly not the title.
>> 
>> 
>> One is:
>> java.lang.RuntimeException: java.io.IOException: Not a number: +
>> java.lang.RuntimeException: java.io.IOException: Not a number: +
>> at org.apache.pdfbox.pdfparser.PDFStreamParser$1.tryNext(PDFStreamParser.java:178)
>> at org.apache.pdfbox.pdfparser.PDFStreamParser$1.hasNext(PDFStreamParser.java:187)
>> at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:266)
>> at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
>> at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
>> at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
>> at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
>> at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
>> at org.dspace.app.mediafilter.PDFFilter.getDestinationStream(PDFFilter.java:101)
>> 
>> 
>> And here’s another:
>> 
>> java.lang.NumberFormatException: For input string: "dup"
>> java.lang.NumberFormatException: For input string: "dup"
>> at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>> at java.lang.Integer.parseInt(Integer.java:492)
>> at java.lang.Integer.parseInt(Integer.java:527)
>> at org.apache.pdfbox.pdmodel.font.PDType1Font.getEncodingFromFont(PDType1Font.java:344)
>> at org.apache.pdfbox.pdmodel.font.PDType1Font.determineEncoding(PDType1Font.java:280)
>> at org.apache.pdfbox.pdmodel.font.PDFont.<init>(PDFont.java:181)
>> at org.apache.pdfbox.pdmodel.font.PDSimpleFont.<init>(PDSimpleFont.java:83)
>> at org.apache.pdfbox.pdmodel.font.PDType1Font.<init>(PDType1Font.java:152)
>> at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:108)
>> at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:
>> 5)
>> at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:115)
>> 
>> Thanks
>> Scott
>> -- 
>> Scott Renton
>> Digital Development
>> Library and University Collections
>> Argyle House, Floor F
>> ext: 515219
>> 
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>For additional commands, e-mail: users-help@pdfbox.apache.org
>

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Mime
View raw message