pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexander Klenner <alexander.garvin.klen...@scai.fraunhofer.de>
Subject Re: errors with PDPage.convertToImage()
Date Mon, 08 Apr 2013 08:14:17 GMT
Hi Maruan,

thank you, I now do have a first clue what is happening, as you suggested I used the command
line with the ExtractImages command, which leads to many Images, those are actually the same,
that I see on my created convertToImage() pages.

Using the ExtractText method from the cml, I get all the text from this PDF. 
So somehow convertToImage() for this particular PDF seems to only return the results from
"ExtractImages".
I also tried PDFToImage using the nonSeq parameter, this method returns exactly the semi-empty
pages that my java code produces. 

So I conclude for some PDFs convertToImage() returns text+images for some it only returns
images. Is this the expected behaviour? 

All PDFs I process have 'real' text, which is selectable and that is not covered by an ImageLayer
of text of some sort (at least I think so). 

I uploaded the PDF and the output of PDFToImage to https://www.dropbox.com/sh/inkcdahx4da1kzp/13bnj-BrZt

Cheers,

Alex



--
Dr. Alexander G. Klenner
Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI)
Schloss Birlinghoven, D-53754 Sankt Augustin
Tel.: +49 - 2241 - 14 - 2736
E-mail: alexander.garvin.klenner@scai.fraunhofer.de
Internet: http://www.scai.fraunhofer.de


----- Original Message -----
From: "Maruan Sahyoun" <sahyoun@fileaffairs.de>
To: users@pdfbox.apache.org
Sent: Monday, April 8, 2013 9:20:10 AM
Subject: Re: errors with PDPage.convertToImage()

Hi,

unfortunately the attachment didn't make it through.

Could you try the PDF in question using the command line app ExtractImage with the -nonSeq
 parameter or use the following code

PDDocument pdDoc = PDDocument.loadNonSeq(…)

The NonSequentialParser gives better results if the document has incremental updates. In addition
it's not necessary to create a new PDDocument from the cosDoc as parser.getDocument already
passes a PDDocument ….

BR from you neighborhood


Maruan Sahyoun

Am 08.04.2013 um 08:52 schrieb Alexander Klenner <alexander.garvin.klenner@scai.fraunhofer.de>:

> Hi all,
> 
> I frequently come across PDFs where the convertToImage() method is generating blank or
partly blank images. One of those PDFs is attached to this mail. 
> 
> My code for processing: 
> 
> PDFParser parser;
> parser = new PDFParser(new FileInputStream(f));
> parser.parse();
> cosDoc = parser.getDocument();
> 
> pdDoc = new PDDocument(cosDoc);
> ..
> Iterator<PDPage> it = pdDoc.getDocumentCatalog().getAllPages().iterator();
> PDPage page = it.next();
> ...
> PDRectangle cropBox = page.findCropBox();
> Dimension dimension = cropBox.createDimension();
> ...
> BufferedImage img = page.convertToImage(BufferedImage.TYPE_INT_RGB, ImageParser.PARAM_DPI);
> 
> 
> I am using pdfbox-app-1.8.0.jar.
> 
> So I have two questions: 
> 
> 1. Is there a different way to to extract the page as an image that I am not aware of
to get the correct image? 
> 2. Or is it possible to detect, that this page was not extracted correctly before or
after the extraction?
> 
> At the moment I just don't know when dealing with a corrupted image.
> 
> Thanks a lot for any hints,
> 
> Alex
> 
> --
> Dr. Alexander G. Klenner
> Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI)
> Schloss Birlinghoven, D-53754 Sankt Augustin
> Tel.: +49 - 2241 - 14 - 2736
> E-mail: alexander.garvin.klenner@scai.fraunhofer.de
> Internet: http://www.scai.fraunhofer.de
> 

Mime
View raw message