pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zhnupy Gonzalez <zhn...@gmail.com>
Subject Re: remove all images from pdf
Date Thu, 08 Nov 2012 18:03:18 GMT
exactly, my goal was to remove everything from the pdf except a grid  and
generate an image of that afterwards.
Oddly ( and luckly! ) the grid remain there eventhough LineTo was removed.

regards






On Thu, Nov 8, 2012 at 11:34 AM, <jlonc@gi-bon.sk> wrote:

> good to hear.
>
> i thought you are trying to clone pdf without images. that's why i
> suggested to create new content stream.
> if you wanted just to convert pdf to image (with the help of PageDrawer)
> then your technique is correct.
>
> btw. you skipped "LineTo" operator which means you skipped all "vector
> pictures" and all other lines in pdf. so you are propably removing even
> lines you don't want to remove (lines are used to paint tables/cells
> borders)
>
>
> Best regards
> Juraj Lonc
>
>
>
>
> From:   Zhnupy Gonzalez <zhnupy@gmail.com>
> To:     users@pdfbox.apache.org,
> Date:   08. 11. 2012 18:05
> Subject:        Re: remove all images from pdf
>
>
>
> Juraj,
> your answer made me put more attention not in streams but in decoders
> instead. So I learn there is this class PageDrawer (actually a subclass of
> PDFStreamEngine) that ultimately produces image from pdf, so I tried
> commenting out a few decoders in PageDrawer.properties until I succeded:
> only when I removed org.apache.pdfbox.util.operator.pagedrawer.LineTo the
> icons I was looking to skip were gone.
>
> thanks!
>
>
> On Thu, Nov 8, 2012 at 8:00 AM, <jlonc@gi-bon.sk> wrote:
>
> > hi,
> > there are several types of pictures.
> >
> > 1. bitmap images that are stored as resources
> > 2. inline bitmap images that are stored within page's content stream
> > 3. images that consist of curves/vectors - these vectors are defined
> > within page's content stream
> >
> > your example code removes only images defined in #1
> > if you want to remove images #2 and #3 it is much harder. you need to
> > parse content stream, remove them, and create new content stream.
> >
> >
> > Best regards
> > Juraj Lonc
> >
> >
> >
> >
> > From:   Zhnupy Gonzalez <zhnupy@gmail.com>
> > To:     users@pdfbox.apache.org,
> > Date:   08. 11. 2012 14:50
> > Subject:        remove all images from pdf
> >
> >
> >
> > Hello everyone,
> > While looking for a way to remove all images from pdf file, I found
> this:
> >
> >
>
> http://stackoverflow.com/questions/6831194/how-can-i-remove-all-images-drawings-from-a-pdf-file-and-leave-text-only-in-java
>
> >
> >
> > which wasn't enough, so I ended replacing the page's resource object
> with
> > a
> > new (empty) one:
> > for (Object pageObj : catalog.getAllPages()) {
> >     PDPage page = (PDPage) pageObj;
> >     page.setResources(new PDResources());
> > }
> >
> > which for my purposes is fine (there are some warnings when opening the
> > file with acrobat reader but it doesn't interfere with my goal).
> >
> > BUT, there are still some images on the document and I don't know how to
> > tear them out. I don't even  know how to "navigate" to those images,  my
> > guess is I need to somehow traverse through page.getCOSDictionary()  and
> > delete appropiate entries but I haven't manage to do that and also not
> > sure
> > if that works.
> >
> > any help?
> > regards
> >
> >
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message