pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Lehmkuehler <andr...@lehmi.de>
Subject Re: Exception :org.apache.pdfbox.filter.FlateFilter - Stop reading corrupt stream
Date Mon, 19 Mar 2012 07:12:23 GMT
Hi,

Am 18.03.2012 09:05, schrieb Cool The Breezer:
>> - one of your PDFs may be corrupt, try to find out if the exception occurs when processing
the very same document
> I can parse the same PDF file without any issue but in multi-threaded environment, after
parsing 200 odd files, I keep on getting this exception and none of files parsed successfully.
Then I had to forcefully stop parser.
Hmmm, PDFBox isn't supposed to be threadsafe, so that could be the problem.

>> - you ran into an issue which was resolved in the current trunk [1]
> I have not tried current trunk and I just downloaded latest binary files i.e. v 1.6.0.
>> - OutOfMememory
> I never get OutOfMememory as I have around 8GB ram in my Mac and I set max ram while
parsing.
You probably won't see the excpetion as it is swallowed.

I reread your code and you might change it to something like that:

..
PDDocument document = new PDDocument(instream);
PDFTextStripper stripper = new PDFTextStripper();
String str = stripper.getText(document);
...

You don't need your own PDFParser.

> regards,
> RB
>
>
> ________________________________
>   From: ""Andreas Lehmkühler""<andreas@lehmi.de>
> To: Cool The Breezer<techcool.kumar@yahoo.com>
> Sent: Friday, March 16, 2012 1:42 AM
> Subject: Re: Exception :org.apache.pdfbox.filter.FlateFilter - Stop reading corrupt stream
>
>
> Hi,
>
> Cool The Breezer<techcool.kumar@yahoo.com>  hat am 15. März 2012 um 07:38 geschrieben:
>
>> Hello Group,
>>                          I recently downloaded PDFBox 1.6.0. I using to parse PDF
files as URL in a multi-threaded environment, max 4 thread. It works fine for ~200 odd files
and then displays following excpetion
>> org.apache.pdfbox.filter.FlateFilter - Stop reading corrupt stream
>> I am using pdfbox in Max OSX lion. I am using following code
>>
>> URL url = new URL( filePath );
>> URLConnection urlConn = url.openConnection();
>> InputStream inStream = urlConn.getInputStream();
>> PDFParser pdfParser = new PDFParser(inStream);
>> pdfParser.parse();
>> document = new PDDocument(pdfParser.getDocument());
>> PDFTextStripper stripper = new PDFTextStripper();
>> String str = stripper.getText(document);
>>
>> inStream.close();
>> output.close();
>> document.close();
>
> There may be a couple of different reasons for that. The version you are using swallows
the origin exception.
>
> - one of your PDFs may be corrupt, try to find out if the exception occurs when processing
the very same document
> - you ran into an issue which was resolved in the current trunk [1]
> - OutOfMememory
>
>>
>> In addition to the above error, I am getting ERROR org.apache.pdfbox.pdmodel.font.PDCIDFont
- Error: Could not parse predefined CMAP file for 'Adobe--UCS2' error but that does not stop
the parser to extract text so I am ignoring this error. Please suggest me any work around.
>>
>> regards,
>> RB
> BR
> Andreas Lehmkühler
>
> [1] https://issues.apache.org/jira/browse/PDFBOX-1232

BR
Andreas Lehmkühler

Mime
View raw message