pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jl...@gi-bon.sk
Subject Re: remove all images from pdf
Date Thu, 08 Nov 2012 17:34:41 GMT
good to hear.

i thought you are trying to clone pdf without images. that's why i 
suggested to create new content stream.
if you wanted just to convert pdf to image (with the help of PageDrawer) 
then your technique is correct.

btw. you skipped "LineTo" operator which means you skipped all "vector 
pictures" and all other lines in pdf. so you are propably removing even 
lines you don't want to remove (lines are used to paint tables/cells 
borders)


Best regards
Juraj Lonc




From:   Zhnupy Gonzalez <zhnupy@gmail.com>
To:     users@pdfbox.apache.org, 
Date:   08. 11. 2012 18:05
Subject:        Re: remove all images from pdf



Juraj,
your answer made me put more attention not in streams but in decoders
instead. So I learn there is this class PageDrawer (actually a subclass of
PDFStreamEngine) that ultimately produces image from pdf, so I tried
commenting out a few decoders in PageDrawer.properties until I succeded:
only when I removed org.apache.pdfbox.util.operator.pagedrawer.LineTo the
icons I was looking to skip were gone.

thanks!


On Thu, Nov 8, 2012 at 8:00 AM, <jlonc@gi-bon.sk> wrote:

> hi,
> there are several types of pictures.
>
> 1. bitmap images that are stored as resources
> 2. inline bitmap images that are stored within page's content stream
> 3. images that consist of curves/vectors - these vectors are defined
> within page's content stream
>
> your example code removes only images defined in #1
> if you want to remove images #2 and #3 it is much harder. you need to
> parse content stream, remove them, and create new content stream.
>
>
> Best regards
> Juraj Lonc
>
>
>
>
> From:   Zhnupy Gonzalez <zhnupy@gmail.com>
> To:     users@pdfbox.apache.org,
> Date:   08. 11. 2012 14:50
> Subject:        remove all images from pdf
>
>
>
> Hello everyone,
> While looking for a way to remove all images from pdf file, I found 
this:
>
> 
http://stackoverflow.com/questions/6831194/how-can-i-remove-all-images-drawings-from-a-pdf-file-and-leave-text-only-in-java

>
>
> which wasn't enough, so I ended replacing the page's resource object 
with
> a
> new (empty) one:
> for (Object pageObj : catalog.getAllPages()) {
>     PDPage page = (PDPage) pageObj;
>     page.setResources(new PDResources());
> }
>
> which for my purposes is fine (there are some warnings when opening the
> file with acrobat reader but it doesn't interfere with my goal).
>
> BUT, there are still some images on the document and I don't know how to
> tear them out. I don't even  know how to "navigate" to those images,  my
> guess is I need to somehow traverse through page.getCOSDictionary()  and
> delete appropiate entries but I haven't manage to do that and also not
> sure
> if that works.
>
> any help?
> regards
>
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message