pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cool The Breezer <techcool.ku...@yahoo.com>
Subject Re: Exception :org.apache.pdfbox.filter.FlateFilter - Stop reading corrupt stream
Date Sun, 18 Mar 2012 08:05:28 GMT
>- one of your PDFs may be corrupt, try to find out if the exception occurs when processing
the very same document
I can parse the same PDF file without any issue but in multi-threaded environment, after parsing
200 odd files, I keep on getting this exception and none of files parsed successfully. Then
I had to forcefully stop parser.
>- you ran into an issue which was resolved in the current trunk [1] 
I have not tried current trunk and I just downloaded latest binary files i.e. v 1.6.0.
>- OutOfMememory
I never get OutOfMememory as I have around 8GB ram in my Mac and I set max ram while parsing.

regards,
RB


________________________________
 From: ""Andreas Lehmkühler"" <andreas@lehmi.de>
To: Cool The Breezer <techcool.kumar@yahoo.com> 
Sent: Friday, March 16, 2012 1:42 AM
Subject: Re: Exception :org.apache.pdfbox.filter.FlateFilter - Stop reading corrupt stream
 

Hi,

Cool The Breezer <techcool.kumar@yahoo.com> hat am 15. März 2012 um 07:38 geschrieben:


> Hello Group, 
>                         I recently downloaded PDFBox 1.6.0. I using to parse
PDF files as URL in a multi-threaded environment, max 4 thread. It works fine for ~200 odd
files and then displays following excpetion 
> org.apache.pdfbox.filter.FlateFilter - Stop reading corrupt stream 
> I am using pdfbox in Max OSX lion. I am using following code 
> 
> URL url = new URL( filePath ); 
> URLConnection urlConn = url.openConnection(); 
> InputStream inStream = urlConn.getInputStream(); 
> PDFParser pdfParser = new PDFParser(inStream); 
> pdfParser.parse(); 
> document = new PDDocument(pdfParser.getDocument()); 
> PDFTextStripper stripper = new PDFTextStripper(); 
> String str = stripper.getText(document); 
> 
> inStream.close();  
> output.close(); 
> document.close(); 
 
There may be a couple of different reasons for that. The version you are using swallows the
origin exception. 
 
- one of your PDFs may be corrupt, try to find out if the exception occurs when processing
the very same document
- you ran into an issue which was resolved in the current trunk [1] 
- OutOfMememory
 
> 
> In addition to the above error, I am getting ERROR org.apache.pdfbox.pdmodel.font.PDCIDFont
- Error: Could not parse predefined CMAP file for 'Adobe--UCS2' error but that does not stop
the parser to extract text so I am ignoring this error. Please suggest me any work around.

> 
> regards, 
> RB 
BR
Andreas Lehmkühler
 
[1] https://issues.apache.org/jira/browse/PDFBOX-1232 
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message