pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: Overlay 2 files partially
Date Tue, 17 May 2016 07:37:42 GMT
Am 17.05.2016 um 04:20 schrieb Romain Guillaume:
> Hi everyone,
> I would like to overlay 2 pdf files but with particular modifications.
> I know how to overlay 2 pdf but sometimes I need to remove some elements of
> one of them during overlay operation.
> For example, imagine an invoice composed with 2 files:
> -one is the background page (containing logo or others fixed graphical
> elements)
> -one is the text page (containing amounts, dates, invoice number, ...)
> As you probably guessed, to obtain final invoice I overlay this 2 pages
> (and it works perfectly in 99.99% of cases)
> That brings me to my problem. Sometimes the "text page" is somewhat
> "dirty". I mean there are some text areas with a white background instead
> of a transparent background. So when I do the overlay, I see on final
> invoice, white areas which overwrite background page (it covers some
> graphical elements and it should not).
> So my question is how to say during overlay operation: "don't keep elements
> which are white backgrounds, or replace them by transparent backgrounds". I
> don't know how parse each element of the pdf and say if this element is a
> white background don't keep it.
> My question is not "how to keep text only". I want remove only white
> backgrounds (or replace them by transparent backgrounds) and keep all
> others elements (all images, all texts, all backgrounds which are not
> white, ...)
> I use pdfbox 1.8.11
> I thank you in advance for your help.
Tricky, even if you can share the file.

You should look at the file with the PDFDebugger app (2.x is better). 
Then find path operators like m, l and re in the content stream of a 
page. Then the color assignment for the non stroking color (could be s, 
sc, scn, k, g, rg) and or insert a transparent graphics state parameter 
and restore it later. Get the token list, change it, and rewrite it into 
a new content stream.

It might be even more tricky if a PDF uses forms (new elements with 
their own content stream). Or if the colors are so that one can't easily 
tell what is white.

However paths are also used to draw lines, boxes, etc that you don't 
want to remove.

And what about invoice that come with a background image? Or a logo? Or 
an image that is actually a bunch of vector graphics?

This is a terrible assignment, maybe the result of a poor business decision.


To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

View raw message