pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maruan Sahyoun <sahy...@fileaffairs.de>
Subject Re: errors with PDPage.convertToImage()
Date Mon, 08 Apr 2013 08:18:06 GMT
Hi,

could you also try the PDFToImage command as Andreas suggested (and I actually meant) as this
will convert a PDF to Image page by page. ExtractImage extracts the images on the page but
doesn't deal with text, line art ….

I will take a quick look at the sample you provided.

BR

Maruan Sahyoun

Am 08.04.2013 um 10:14 schrieb Alexander Klenner <alexander.garvin.klenner@scai.fraunhofer.de>:

> Hi Maruan,
> 
> thank you, I now do have a first clue what is happening, as you suggested I used the
command line with the ExtractImages command, which leads to many Images, those are actually
the same, that I see on my created convertToImage() pages.
> 
> Using the ExtractText method from the cml, I get all the text from this PDF. 
> So somehow convertToImage() for this particular PDF seems to only return the results
from "ExtractImages".
> I also tried PDFToImage using the nonSeq parameter, this method returns exactly the semi-empty
pages that my java code produces. 
> 
> So I conclude for some PDFs convertToImage() returns text+images for some it only returns
images. Is this the expected behaviour? 
> 
> All PDFs I process have 'real' text, which is selectable and that is not covered by an
ImageLayer of text of some sort (at least I think so). 
> 
> I uploaded the PDF and the output of PDFToImage to https://www.dropbox.com/sh/inkcdahx4da1kzp/13bnj-BrZt
> 
> Cheers,
> 
> Alex
> 
> 
> 
> --
> Dr. Alexander G. Klenner
> Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI)
> Schloss Birlinghoven, D-53754 Sankt Augustin
> Tel.: +49 - 2241 - 14 - 2736
> E-mail: alexander.garvin.klenner@scai.fraunhofer.de
> Internet: http://www.scai.fraunhofer.de
> 
> 
> ----- Original Message -----
> From: "Maruan Sahyoun" <sahyoun@fileaffairs.de>
> To: users@pdfbox.apache.org
> Sent: Monday, April 8, 2013 9:20:10 AM
> Subject: Re: errors with PDPage.convertToImage()
> 
> Hi,
> 
> unfortunately the attachment didn't make it through.
> 
> Could you try the PDF in question using the command line app ExtractImage with the -nonSeq
 parameter or use the following code
> 
> PDDocument pdDoc = PDDocument.loadNonSeq(…)
> 
> The NonSequentialParser gives better results if the document has incremental updates.
In addition it's not necessary to create a new PDDocument from the cosDoc as parser.getDocument
already passes a PDDocument ….
> 
> BR from you neighborhood
> 
> 
> Maruan Sahyoun
> 
> Am 08.04.2013 um 08:52 schrieb Alexander Klenner <alexander.garvin.klenner@scai.fraunhofer.de>:
> 
>> Hi all,
>> 
>> I frequently come across PDFs where the convertToImage() method is generating blank
or partly blank images. One of those PDFs is attached to this mail. 
>> 
>> My code for processing: 
>> 
>> PDFParser parser;
>> parser = new PDFParser(new FileInputStream(f));
>> parser.parse();
>> cosDoc = parser.getDocument();
>> 
>> pdDoc = new PDDocument(cosDoc);
>> ..
>> Iterator<PDPage> it = pdDoc.getDocumentCatalog().getAllPages().iterator();
>> PDPage page = it.next();
>> ...
>> PDRectangle cropBox = page.findCropBox();
>> Dimension dimension = cropBox.createDimension();
>> ...
>> BufferedImage img = page.convertToImage(BufferedImage.TYPE_INT_RGB, ImageParser.PARAM_DPI);
>> 
>> 
>> I am using pdfbox-app-1.8.0.jar.
>> 
>> So I have two questions: 
>> 
>> 1. Is there a different way to to extract the page as an image that I am not aware
of to get the correct image? 
>> 2. Or is it possible to detect, that this page was not extracted correctly before
or after the extraction?
>> 
>> At the moment I just don't know when dealing with a corrupted image.
>> 
>> Thanks a lot for any hints,
>> 
>> Alex
>> 
>> --
>> Dr. Alexander G. Klenner
>> Fraunhofer-Institute for Algorithms and Scientific Computing (SCAI)
>> Schloss Birlinghoven, D-53754 Sankt Augustin
>> Tel.: +49 - 2241 - 14 - 2736
>> E-mail: alexander.garvin.klenner@scai.fraunhofer.de
>> Internet: http://www.scai.fraunhofer.de
>> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message