pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: Looking for a way to iterate over images in a PDF
Date Fri, 07 Apr 2017 21:32:57 GMT
Am 07.04.2017 um 22:59 schrieb David Patterson:
> Tilman,
>
> The ExtractImages sample code is a 1.8 artifact (I believe). It has a lot
> of errors when compiled with 2.0.5 libraries.

Please try this one:
https://svn.apache.org/viewvc/pdfbox/trunk/tools/src/main/java/org/apache/pdfbox/tools/ExtractImages.java?view=markup

Tilman

>
> 1) two imports are no longer in the 2.0.5 library
> import org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectForm;
> import org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage;
>
> 2) missing methods or methods with different signatures:
> PDDocument.loadNonSeq(                                            ** method
> not define
> PDDocument.load(                                                       **
> load now requires a File, not a String
> document.openProtection (
> document.getDocumentCatalog().getAllPages()              ** getAllPages is
> missing from the PDDocumentCatalog
> resources.getXObjects()                                               **
> where resources is a PDResources object
> if (xobject instanceof PDXObjectImage)                         **
> PDXObjectImage is not defined
> else if (xobject instanceof PDXObjectForm)                   ** same with
> PDXObjectForm
>
> Maybe a new ExtractImages2 program needs to be developed for the PDFBox 2
> era.
>
> Dave Patterson
>
>
>
>
> On Thu, Apr 6, 2017 at 5:02 PM, Tilman Hausherr <THausherr@t-online.de>
> wrote:
>
>> Am 06.04.2017 um 21:22 schrieb David Patterson:
>>
>>> I've got some PDF's to try to read. Many of them have images in them. I'd
>>> like to be able to iterate over the images and determine their encoding
>>> (png vs. jpeg vs. ?) and size.
>>>
>>> I've found a sample that lets me iterate over the PDXObject entities, but
>>> I'm missing a key piece to determine the size and format of the objects.
>>>
>>> a) Is a PDXObject always an image, or could it be something else?
>>>
>> Yes it could be a form. That's why all examples (e.g. ExtractImages.java)
>> always check the type, and the cast to the image xobject type. That one
>> will give the size and the filters.
>>
>> Tilman
>>
>>
>>> Here is the code I've got so far.
>>>
>>> for ( PDPage aPage : pdfDocument.getPages() ) {
>>> PDResources pdResources = aPage.getResources();
>>> for ( COSName cosObject : pdResources.getXObjectNames() ) {
>>> PDXObject xObj = pdResources.getXObject( cosObject);
>>> System.out.println( "got an image maybe" );
>>>
>>> This is where I've gotten stumped. I've looked at lots of lists of
>>> COS-whatever things, but it has not led me to "the answer."
>>>
>>> Thanks for any guidance you can provide.
>>>
>>> Dave Patterson
>>>
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message