pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From a7med shre3y <a7med.shr...@gmail.com>
Subject Re: Text removal
Date Tue, 24 Mar 2015 11:49:51 GMT
The question here is how does the text still show up in the output file???
I assume the text should have been cached somewhere else in the PDF! I
don't know if my assumption is correct, do you have any explanation for
that?

On Tue, Mar 24, 2015 at 10:46 AM, Maruan Sahyoun <sahyoun@fileaffairs.de>
wrote:

>
> > Am 24.03.2015 um 10:43 schrieb a7med shre3y <a7med.shre3y@gmail.com>:
> >
> > I mean how to find them in the PDF while rotating over the tokens, what
> is
> > the operator?
> >
> > On Tue, Mar 24, 2015 at 10:40 AM, Maruan Sahyoun <sahyoun@fileaffairs.de
> >
> > wrote:
> >
> >>
> >>> Am 24.03.2015 um 10:36 schrieb a7med shre3y <a7med.shre3y@gmail.com>:
> >>>
> >>> What are the drawing commands? I'd then investigate one how to specify
> >> the
> >>> text ones.
> >>>
> >>
> >> 738.7469 167.1278 m
>
> MoveTo
>
> >> 733.8743 167.1278 l
> >>
>
> LineTo
>
>
> >>
> >>
> >>> On Tue, Mar 24, 2015 at 10:26 AM, Maruan Sahyoun <
> sahyoun@fileaffairs.de
> >>>
> >>> wrote:
> >>>
> >>>>
> >>>>> Am 24.03.2015 um 10:14 schrieb a7med shre3y <a7med.shre3y@gmail.com
> >:
> >>>>>
> >>>>> That's true, I've even tried to change the rendering text mode to
> other
> >>>>> values already as mentioned in the PDF specs 1.5 table 5.3 before
> >>>> removing
> >>>>> it also didn't work.
> >>>>> So how to remove the graphics content then?
> >>>>
> >>>> the simple answer - remove the drawing commands.
> >>>>
> >>>> The longer answer as you obviously don't want to remove all drawing
> >>>> commands you'd need to find which are the ones drawing the text. As
> you
> >>>> would like to remove certain vectors which are matching a certain
> >>>> character/glyph you first need to find out which are the ones drawing
> >> e.g.
> >>>> the letter 'T'. I don't think that this is doable in a reasonable
> >> amount of
> >>>> time for arbitary text.
> >>>>
> >>>> Maruan
> >>>>
> >>>>
> >>>>>
> >>>>> Best Regards,
> >>>>>
> >>>>> On Tue, Mar 24, 2015 at 10:06 AM, Maruan Sahyoun <
> >> sahyoun@fileaffairs.de
> >>>>>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>>> Am 24.03.2015 um 09:55 schrieb a7med shre3y <
> a7med.shre3y@gmail.com
> >>> :
> >>>>>>>
> >>>>>>> You can download it from here:
> >>>>>>>
> >>>>>>
> >>>>
> >>
> https://drive.google.com/file/d/0B5Kxacm1mej-MEZubTNYVVJYTFE/view?usp=sharing
> >>>>>>>
> >>>>>>
> >>>>>> looking more closely you correctly replaced the text, but that
text
> >> was
> >>>> in
> >>>>>> there for searching within the PDF as it used text rendering
mode 3
> >>>>>> (invisible). The 'text' you are still seeing is drawn using
vector
> >>>> commands
> >>>>>> so it's graphics content.
> >>>>>>
> >>>>>> BR
> >>>>>> Maruan
> >>>>>>
> >>>>>>
> >>>>>>> Best Regards,
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Mar 24, 2015 at 9:48 AM, Maruan Sahyoun <
> >>>> sahyoun@fileaffairs.de>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> Am 24.03.2015 um 09:40 schrieb a7med shre3y <
> >> a7med.shre3y@gmail.com
> >>>>> :
> >>>>>>>>>
> >>>>>>>>> Hi,
> >>>>>>>>>
> >>>>>>>>> In fact PDFBox call the operation of transforming
"7R %H
> $SSURYHG"
> >> to
> >>>>>> "To
> >>>>>>>>> Be Approved" as "encoding". Anyway, either it's
encoding or
> >>>> decoding, I
> >>>>>>>>> thought it's easier to transform "7R %H $SSURYHG"
to "To Be
> >> Approved"
> >>>>>> and
> >>>>>>>>> not the opposite (or at least I don't know). I spent
some quite
> >> long
> >>>>>> time
> >>>>>>>>> trying to find out how to find the character codes
for the glyphs
> >> in
> >>>>>> the
> >>>>>>>>> currently used font, then I found that it's not
an easy task. By
> >> the
> >>>>>> way,
> >>>>>>>>> if you know how to do that, I'd so much appreciate
it because I
> >> need
> >>>>>> that
> >>>>>>>>> for replacing text with another text and for that
the new text
> must
> >>>> be
> >>>>>>>>> encoded the same way as the original!
> >>>>>>>>>
> >>>>>>>>> Back to the text removal, I am able to find the
text and also
> >> remove
> >>>> it
> >>>>>>>> by
> >>>>>>>>> calling reset, as I mentioned in my first email,
when I print the
> >>>>>> output
> >>>>>>>>> content I don't find the text anymore but I still
see it when I
> >> open
> >>>>>> the
> >>>>>>>>> file. My first assumption was that there must be
some other way
> to
> >>>>>> remove
> >>>>>>>>> the text other than the way I am using, and that's
what you've
> >>>> actually
> >>>>>>>>> confirmed in your reply, so could you please tell
me what still
> >>>>>> missing?
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> Could you upload the PDF with the reset text too?
> >>>>>>>>
> >>>>>>>> BR
> >>>>>>>> Maruan
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> Thanks and regards,
> >>>>>>>>> a7mad
> >>>>>>>>>
> >>>>>>>>> On Tue, Mar 24, 2015 at 9:22 AM, Maruan Sahyoun
<
> >>>>>> sahyoun@fileaffairs.de>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>>> Am 24.03.2015 um 08:14 schrieb a7med shre3y
<
> >>>> a7med.shre3y@gmail.com
> >>>>>>> :
> >>>>>>>>>>>
> >>>>>>>>>>> Hi,
> >>>>>>>>>>>
> >>>>>>>>>>> Here's how I do it:
> >>>>>>>>>>>
> >>>>>>>>>>> 1. I use the following method to encode
the text:
> >>>>>>>>>>>
> >>>>>>>>>>> String encode(String text, PDFont font)
throws Exception {
> >>>>>>>>>>>   StringBuilder builder = new StringBuilder();
> >>>>>>>>>>>   byte[] stringBytes = text.getBytes();
> >>>>>>>>>>>   int codeLength = 1;
> >>>>>>>>>>>   for(int i = 0; i < stringBytes.length;
i += codeLength){
> >>>>>>>>>>>           String c = font.encode(stringBytes,
i, codeLength);
> >>>>>>>>>>>           if(c == null && (i + 1
< stringBytes.length)){
> >>>>>>>>>>>               codeLength++;
> >>>>>>>>>>>               c = font.encode(stringBytes,
i, codeLength);
> >>>>>>>>>>>           }
> >>>>>>>>>>>           builder.append(c);
> >>>>>>>>>>>       }
> >>>>>>>>>>>   return builder.toString();
> >>>>>>>>>>> }
> >>>>>>>>>>>
> >>>>>>>>>>> 2. Iterating through the tokens, I find
the text either it's a
> >>>>>>>> COSString
> >>>>>>>>>>> ("Tj" operator) or a COSArray ("TJ" operator)
then check if
> it's
> >>>> the
> >>>>>>>> text
> >>>>>>>>>>> I'm looking for to remove as following:
> >>>>>>>>>>>
> >>>>>>>>>>> if (op.getOperation().equals("Tj")) {
> >>>>>>>>>>>                       COSString previous
= (COSString)
> >>>>>> tokens.get(j
> >>>>>>>>>> -
> >>>>>>>>>>> 1);
> >>>>>>>>>>>                       String string = previous.getString();
> >>>>>>>>>>>                       String encodedString
= encode(string,
> >>>> font);
> >>>>>>>>>>
> >>>>>>>>>> that string is already encoded. So you'd need
to encode "To Be
> >>>>>> Approved"
> >>>>>>>>>> and compare if that matches the string you are
reading from the
> >> PDF.
> >>>>>>>>>>
> >>>>>>>>>>>                       if(encodedString.contains("To
Be
> >>>>>> Approved")){
> >>>>>>>>>>>                           previous.reset();
> >>>>>>>>>>>                       }
> >>>>>>>>>>>                   } else if (op.getOperation().equals("TJ"))
{
> >>>>>>>>>>>                       COSArray previous
= (COSArray)
> >> tokens.get(j
> >>>>>> -
> >>>>>>>>>>> 1);
> >>>>>>>>>>>                       StringBuilder stringBuilder
= new
> >>>>>>>>>>> StringBuilder();
> >>>>>>>>>>>                       for (int k = 0; k
< previous.size(); k++)
> >> {
> >>>>>>>>>>>                           Object arrElement
=
> >>>>>>>> previous.getObject(k);
> >>>>>>>>>>>                           if (arrElement
instanceof COSString)
> {
> >>>>>>>>>>>                               COSString
cosString = (COSString)
> >>>>>>>>>>> arrElement;
> >>>>>>>>>>>
> >>>>>>>>>>> stringBuilder.append(cosString.getString());
> >>>>>>>>>>>                           }
> >>>>>>>>>>>                       }
> >>>>>>>>>>>                       String string = stringBuilder.toString();
> >>>>>>>>>>>                       String encodedString
= encode(string,
> >>>> font);
> >>>>>>>>>>>                       if(encodedString.contains("To
Be
> >>>>>> Approved")){
> >>>>>>>>>>>                           previous.clear();
> >>>>>>>>>>>                       }
> >>>>>>>>>>>                   }
> >>>>>>>>>>>
> >>>>>>>>>>> Note:
> >>>>>>>>>>> In case of COSArray, I first iterate through
the whole array to
> >> get
> >>>>>> the
> >>>>>>>>>>> whole string before encoding and comparison
and this works.
> >>>>>>>>>>>
> >>>>>>>>>>> Best Regards,
> >>>>>>>>>>> a7mad
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Mon, Mar 23, 2015 at 10:48 PM, Maruan
Sahyoun <
> >>>>>>>> sahyoun@fileaffairs.de
> >>>>>>>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi,
> >>>>>>>>>>>>
> >>>>>>>>>>>> your text is encoded so within the show
text operator Tj the
> >>>> string
> >>>>>> is
> >>>>>>>>>>>>
> >>>>>>>>>>>> 7R %H $SSURYHG
> >>>>>>>>>>>>
> >>>>>>>>>>>> You wrote that you encode your string
to find it - what do you
> >>>> get?
> >>>>>>>>>>>>
> >>>>>>>>>>>> BR
> >>>>>>>>>>>> Maruan
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Am 23.03.2015 um 22:01 schrieb a7med
shre3y <
> >>>>>> a7med.shre3y@gmail.com
> >>>>>>>>> :
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Hi Maruan,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Here's a link from where you can
download the PDF.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>
> >>
> https://drive.google.com/file/d/0B5Kxacm1mej-bm82NzNvUXFPSmMtUjc0ZFVjVVlrODZnRzdn/view?usp=sharing
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Kind Regards,
> >>>>>>>>>>>>> a7mad
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Mon, Mar 23, 2015 at 8:57 PM,
Maruan Sahyoun <
> >>>>>>>>>> sahyoun@fileaffairs.de>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> you need to upload it to a public
location as the mailing
> list
> >>>>>>>> doesn't
> >>>>>>>>>>>>>> support attachments.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> BR
> >>>>>>>>>>>>>> Maruan
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Am 23.03.2015 um 19:18 schrieb
a7med shre3y <
> >>>>>>>> a7med.shre3y@gmail.com
> >>>>>>>>>>> :
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Dear Maruan,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thank you very much for
the information. Please find
> herewith
> >>>>>>>>>> attached
> >>>>>>>>>>>>>> the PDF to reproduce the problem.
> >>>>>>>>>>>>>>> The text to remove is: "To
Be Approved". The text has a
> >>>>>> multi-byte
> >>>>>>>>>>>>>> encoding, so I call first to
encode it in order to find it
> >> then
> >>>>>>>> remove
> >>>>>>>>>>>> it.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Best Regards,
> >>>>>>>>>>>>>>> a7mad
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Mon, Mar 23, 2015
at 4:13 PM, Maruan Sahyoun <
> >>>>>>>>>>>> sahyoun@fileaffairs.de>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>> Dear a7mad,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> removing text from a
PDF is not an easy task as
> >>>>>>>>>>>>>>>> - text which might visually
appear as a single item might
> >>>>>>>> consistent
> >>>>>>>>>>>> of
> >>>>>>>>>>>>>> individual parts within the
PDF itself e.g. each character
> or
> >>>>>> groups
> >>>>>>>>>> of
> >>>>>>>>>>>>>> characters are place individually
in different COSStrings
> >>>>>>>>>>>>>>>> - text might be drawn
using graphics commands
> >>>>>>>>>>>>>>>> - text can appear within
different parts of the PDF (e.g.
> >> the
> >>>>>> text
> >>>>>>>>>>>>>> might be content of a form field
AND the annotation
> >> representing
> >>>>>> the
> >>>>>>>>>>>> form
> >>>>>>>>>>>>>> field visually)
> >>>>>>>>>>>>>>>> - you need to look up
the encoding information to get form
> >> the
> >>>>>>>>>>>>>> characters in the PDF "string"
to the ones you are looking
> for
> >>>>>>>>>>>>>>>> ….
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> If you can post a specific
PDF to a public location and
> >>>> describe
> >>>>>>>> in
> >>>>>>>>>>>>>> detail which string should have
been replaced which hasn't I
> >>>> will
> >>>>>> be
> >>>>>>>>>>>> able
> >>>>>>>>>>>>>> to tell you why that might have
happened.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Maruan
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Am 23.03.2015 um
15:03 schrieb a7med shre3y <
> >>>>>>>>>> a7med.shre3y@gmail.com
> >>>>>>>>>>>>> :
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Currently I am facing
a strange problem removing text
> from
> >>>> the
> >>>>>>>> some
> >>>>>>>>>>>>>> PDFs.
> >>>>>>>>>>>>>>>>> My program is able
to find the text and "remove it" by
> >>>> calling
> >>>>>>>> the
> >>>>>>>>>>>>>>>>> COSString.reset()
method.
> >>>>>>>>>>>>>>>>> The problem is,
when I open the output PDF file, I still
> >> see
> >>>>>> the
> >>>>>>>>>> text
> >>>>>>>>>>>>>> but
> >>>>>>>>>>>>>>>>> not selectable (I
mean when I try to highlight it with
> the
> >>>>>> mouse
> >>>>>>>> to
> >>>>>>>>>>>>>> copy
> >>>>>>>>>>>>>>>>> it, it's not selectable!).
When print the content
> (tokens)
> >> of
> >>>>>> the
> >>>>>>>>>>>>>> output
> >>>>>>>>>>>>>>>>> file, I DO NOT find
the text at all!!
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I am currently stuck
in the PDF specifications 1.5 and
> >> really
> >>>>>>>>>> running
> >>>>>>>>>>>>>> out
> >>>>>>>>>>>>>>>>> of time.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> I'd so much appreciate
any help or any idea on what's
> going
> >>>> on.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Notes:
> >>>>>>>>>>>>>>>>> 1. I use use PDFBox
1.7.1
> >>>>>>>>>>>>>>>>> 2. This problem
does not occur with all PDFs, only some
> >> PDFs
> >>>>>>>> cause
> >>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>> problem.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thank you very much.
> >>>>>>>>>>>>>>>>> a7mad
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>
> >>>> ---------------------------------------------------------------------
> >>>>>>>>>>>>>>>> To unsubscribe, e-mail:
> users-unsubscribe@pdfbox.apache.org
> >>>>>>>>>>>>>>>> For additional commands,
e-mail:
> >> users-help@pdfbox.apache.org
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>
> >> ---------------------------------------------------------------------
> >>>>>>>>>>>>>>> To unsubscribe, e-mail:
> users-unsubscribe@pdfbox.apache.org
> >>>>>>>>>>>>>>> For additional commands,
e-mail:
> >> users-help@pdfbox.apache.org
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >>>>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>> ---------------------------------------------------------------------
> >>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >> ---------------------------------------------------------------------
> >>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >>>> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>>>
> >>>>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message