pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From a7med shre3y <a7med.shr...@gmail.com>
Subject Re: Text removal
Date Tue, 24 Mar 2015 09:43:05 GMT
I mean how to find them in the PDF while rotating over the tokens, what is
the operator?

On Tue, Mar 24, 2015 at 10:40 AM, Maruan Sahyoun <sahyoun@fileaffairs.de>
wrote:

>
> > Am 24.03.2015 um 10:36 schrieb a7med shre3y <a7med.shre3y@gmail.com>:
> >
> > What are the drawing commands? I'd then investigate one how to specify
> the
> > text ones.
> >
>
> 738.7469 167.1278 m
> 733.8743 167.1278 l
>
>
>
> > On Tue, Mar 24, 2015 at 10:26 AM, Maruan Sahyoun <sahyoun@fileaffairs.de
> >
> > wrote:
> >
> >>
> >>> Am 24.03.2015 um 10:14 schrieb a7med shre3y <a7med.shre3y@gmail.com>:
> >>>
> >>> That's true, I've even tried to change the rendering text mode to other
> >>> values already as mentioned in the PDF specs 1.5 table 5.3 before
> >> removing
> >>> it also didn't work.
> >>> So how to remove the graphics content then?
> >>
> >> the simple answer - remove the drawing commands.
> >>
> >> The longer answer as you obviously don't want to remove all drawing
> >> commands you'd need to find which are the ones drawing the text. As you
> >> would like to remove certain vectors which are matching a certain
> >> character/glyph you first need to find out which are the ones drawing
> e.g.
> >> the letter 'T'. I don't think that this is doable in a reasonable
> amount of
> >> time for arbitary text.
> >>
> >> Maruan
> >>
> >>
> >>>
> >>> Best Regards,
> >>>
> >>> On Tue, Mar 24, 2015 at 10:06 AM, Maruan Sahyoun <
> sahyoun@fileaffairs.de
> >>>
> >>> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>>> Am 24.03.2015 um 09:55 schrieb a7med shre3y <a7med.shre3y@gmail.com
> >:
> >>>>>
> >>>>> You can download it from here:
> >>>>>
> >>>>
> >>
> https://drive.google.com/file/d/0B5Kxacm1mej-MEZubTNYVVJYTFE/view?usp=sharing
> >>>>>
> >>>>
> >>>> looking more closely you correctly replaced the text, but that text
> was
> >> in
> >>>> there for searching within the PDF as it used text rendering mode 3
> >>>> (invisible). The 'text' you are still seeing is drawn using vector
> >> commands
> >>>> so it's graphics content.
> >>>>
> >>>> BR
> >>>> Maruan
> >>>>
> >>>>
> >>>>> Best Regards,
> >>>>>
> >>>>>
> >>>>> On Tue, Mar 24, 2015 at 9:48 AM, Maruan Sahyoun <
> >> sahyoun@fileaffairs.de>
> >>>>> wrote:
> >>>>>
> >>>>>>
> >>>>>>
> >>>>>>> Am 24.03.2015 um 09:40 schrieb a7med shre3y <
> a7med.shre3y@gmail.com
> >>> :
> >>>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> In fact PDFBox call the operation of transforming "7R %H
$SSURYHG"
> to
> >>>> "To
> >>>>>>> Be Approved" as "encoding". Anyway, either it's encoding
or
> >> decoding, I
> >>>>>>> thought it's easier to transform "7R %H $SSURYHG" to "To
Be
> Approved"
> >>>> and
> >>>>>>> not the opposite (or at least I don't know). I spent some
quite
> long
> >>>> time
> >>>>>>> trying to find out how to find the character codes for the
glyphs
> in
> >>>> the
> >>>>>>> currently used font, then I found that it's not an easy
task. By
> the
> >>>> way,
> >>>>>>> if you know how to do that, I'd so much appreciate it because
I
> need
> >>>> that
> >>>>>>> for replacing text with another text and for that the new
text must
> >> be
> >>>>>>> encoded the same way as the original!
> >>>>>>>
> >>>>>>> Back to the text removal, I am able to find the text and
also
> remove
> >> it
> >>>>>> by
> >>>>>>> calling reset, as I mentioned in my first email, when I
print the
> >>>> output
> >>>>>>> content I don't find the text anymore but I still see it
when I
> open
> >>>> the
> >>>>>>> file. My first assumption was that there must be some other
way to
> >>>> remove
> >>>>>>> the text other than the way I am using, and that's what
you've
> >> actually
> >>>>>>> confirmed in your reply, so could you please tell me what
still
> >>>> missing?
> >>>>>>>
> >>>>>>
> >>>>>> Could you upload the PDF with the reset text too?
> >>>>>>
> >>>>>> BR
> >>>>>> Maruan
> >>>>>>
> >>>>>>
> >>>>>>> Thanks and regards,
> >>>>>>> a7mad
> >>>>>>>
> >>>>>>> On Tue, Mar 24, 2015 at 9:22 AM, Maruan Sahyoun <
> >>>> sahyoun@fileaffairs.de>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>>> Am 24.03.2015 um 08:14 schrieb a7med shre3y <
> >> a7med.shre3y@gmail.com
> >>>>> :
> >>>>>>>>>
> >>>>>>>>> Hi,
> >>>>>>>>>
> >>>>>>>>> Here's how I do it:
> >>>>>>>>>
> >>>>>>>>> 1. I use the following method to encode the text:
> >>>>>>>>>
> >>>>>>>>> String encode(String text, PDFont font) throws Exception
{
> >>>>>>>>>    StringBuilder builder = new StringBuilder();
> >>>>>>>>>    byte[] stringBytes = text.getBytes();
> >>>>>>>>>    int codeLength = 1;
> >>>>>>>>>    for(int i = 0; i < stringBytes.length; i +=
codeLength){
> >>>>>>>>>            String c = font.encode(stringBytes, i,
codeLength);
> >>>>>>>>>            if(c == null && (i + 1 < stringBytes.length)){
> >>>>>>>>>                codeLength++;
> >>>>>>>>>                c = font.encode(stringBytes, i, codeLength);
> >>>>>>>>>            }
> >>>>>>>>>            builder.append(c);
> >>>>>>>>>        }
> >>>>>>>>>    return builder.toString();
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>> 2. Iterating through the tokens, I find the text
either it's a
> >>>>>> COSString
> >>>>>>>>> ("Tj" operator) or a COSArray ("TJ" operator) then
check if it's
> >> the
> >>>>>> text
> >>>>>>>>> I'm looking for to remove as following:
> >>>>>>>>>
> >>>>>>>>> if (op.getOperation().equals("Tj")) {
> >>>>>>>>>                        COSString previous = (COSString)
> >>>> tokens.get(j
> >>>>>>>> -
> >>>>>>>>> 1);
> >>>>>>>>>                        String string = previous.getString();
> >>>>>>>>>                        String encodedString = encode(string,
> >> font);
> >>>>>>>>
> >>>>>>>> that string is already encoded. So you'd need to encode
"To Be
> >>>> Approved"
> >>>>>>>> and compare if that matches the string you are reading
from the
> PDF.
> >>>>>>>>
> >>>>>>>>>                        if(encodedString.contains("To
Be
> >>>> Approved")){
> >>>>>>>>>                            previous.reset();
> >>>>>>>>>                        }
> >>>>>>>>>                    } else if (op.getOperation().equals("TJ"))
{
> >>>>>>>>>                        COSArray previous = (COSArray)
> tokens.get(j
> >>>> -
> >>>>>>>>> 1);
> >>>>>>>>>                        StringBuilder stringBuilder
= new
> >>>>>>>>> StringBuilder();
> >>>>>>>>>                        for (int k = 0; k < previous.size();
k++)
> {
> >>>>>>>>>                            Object arrElement =
> >>>>>> previous.getObject(k);
> >>>>>>>>>                            if (arrElement instanceof
COSString) {
> >>>>>>>>>                                COSString cosString
= (COSString)
> >>>>>>>>> arrElement;
> >>>>>>>>>
> >>>>>>>>> stringBuilder.append(cosString.getString());
> >>>>>>>>>                            }
> >>>>>>>>>                        }
> >>>>>>>>>                        String string = stringBuilder.toString();
> >>>>>>>>>                        String encodedString = encode(string,
> >> font);
> >>>>>>>>>                        if(encodedString.contains("To
Be
> >>>> Approved")){
> >>>>>>>>>                            previous.clear();
> >>>>>>>>>                        }
> >>>>>>>>>                    }
> >>>>>>>>>
> >>>>>>>>> Note:
> >>>>>>>>> In case of COSArray, I first iterate through the
whole array to
> get
> >>>> the
> >>>>>>>>> whole string before encoding and comparison and
this works.
> >>>>>>>>>
> >>>>>>>>> Best Regards,
> >>>>>>>>> a7mad
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Mon, Mar 23, 2015 at 10:48 PM, Maruan Sahyoun
<
> >>>>>> sahyoun@fileaffairs.de
> >>>>>>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> your text is encoded so within the show text
operator Tj the
> >> string
> >>>> is
> >>>>>>>>>>
> >>>>>>>>>> 7R %H $SSURYHG
> >>>>>>>>>>
> >>>>>>>>>> You wrote that you encode your string to find
it - what do you
> >> get?
> >>>>>>>>>>
> >>>>>>>>>> BR
> >>>>>>>>>> Maruan
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> Am 23.03.2015 um 22:01 schrieb a7med shre3y
<
> >>>> a7med.shre3y@gmail.com
> >>>>>>> :
> >>>>>>>>>>>
> >>>>>>>>>>> Hi Maruan,
> >>>>>>>>>>>
> >>>>>>>>>>> Here's a link from where you can download
the PDF.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>
> >>
> https://drive.google.com/file/d/0B5Kxacm1mej-bm82NzNvUXFPSmMtUjc0ZFVjVVlrODZnRzdn/view?usp=sharing
> >>>>>>>>>>>
> >>>>>>>>>>> Kind Regards,
> >>>>>>>>>>> a7mad
> >>>>>>>>>>>
> >>>>>>>>>>> On Mon, Mar 23, 2015 at 8:57 PM, Maruan
Sahyoun <
> >>>>>>>> sahyoun@fileaffairs.de>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi,
> >>>>>>>>>>>>
> >>>>>>>>>>>> you need to upload it to a public location
as the mailing list
> >>>>>> doesn't
> >>>>>>>>>>>> support attachments.
> >>>>>>>>>>>>
> >>>>>>>>>>>> BR
> >>>>>>>>>>>> Maruan
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Am 23.03.2015 um 19:18 schrieb a7med
shre3y <
> >>>>>> a7med.shre3y@gmail.com
> >>>>>>>>> :
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Dear Maruan,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thank you very much for the information.
Please find herewith
> >>>>>>>> attached
> >>>>>>>>>>>> the PDF to reproduce the problem.
> >>>>>>>>>>>>> The text to remove is: "To Be Approved".
The text has a
> >>>> multi-byte
> >>>>>>>>>>>> encoding, so I call first to encode
it in order to find it
> then
> >>>>>> remove
> >>>>>>>>>> it.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Best Regards,
> >>>>>>>>>>>>> a7mad
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Mon, Mar 23, 2015 at 4:13
PM, Maruan Sahyoun <
> >>>>>>>>>> sahyoun@fileaffairs.de>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>> Dear a7mad,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> removing text from a PDF is
not an easy task as
> >>>>>>>>>>>>>> - text which might visually
appear as a single item might
> >>>>>> consistent
> >>>>>>>>>> of
> >>>>>>>>>>>> individual parts within the PDF itself
e.g. each character or
> >>>> groups
> >>>>>>>> of
> >>>>>>>>>>>> characters are place individually in
different COSStrings
> >>>>>>>>>>>>>> - text might be drawn using
graphics commands
> >>>>>>>>>>>>>> - text can appear within different
parts of the PDF (e.g.
> the
> >>>> text
> >>>>>>>>>>>> might be content of a form field AND
the annotation
> representing
> >>>> the
> >>>>>>>>>> form
> >>>>>>>>>>>> field visually)
> >>>>>>>>>>>>>> - you need to look up the encoding
information to get form
> the
> >>>>>>>>>>>> characters in the PDF "string" to the
ones you are looking for
> >>>>>>>>>>>>>> ….
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> If you can post a specific PDF
to a public location and
> >> describe
> >>>>>> in
> >>>>>>>>>>>> detail which string should have been
replaced which hasn't I
> >> will
> >>>> be
> >>>>>>>>>> able
> >>>>>>>>>>>> to tell you why that might have happened.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Maruan
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Am 23.03.2015 um 15:03 schrieb
a7med shre3y <
> >>>>>>>> a7med.shre3y@gmail.com
> >>>>>>>>>>> :
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Currently I am facing a
strange problem removing text from
> >> the
> >>>>>> some
> >>>>>>>>>>>> PDFs.
> >>>>>>>>>>>>>>> My program is able to find
the text and "remove it" by
> >> calling
> >>>>>> the
> >>>>>>>>>>>>>>> COSString.reset() method.
> >>>>>>>>>>>>>>> The problem is, when I open
the output PDF file, I still
> see
> >>>> the
> >>>>>>>> text
> >>>>>>>>>>>> but
> >>>>>>>>>>>>>>> not selectable (I mean when
I try to highlight it with the
> >>>> mouse
> >>>>>> to
> >>>>>>>>>>>> copy
> >>>>>>>>>>>>>>> it, it's not selectable!).
When print the content (tokens)
> of
> >>>> the
> >>>>>>>>>>>> output
> >>>>>>>>>>>>>>> file, I DO NOT find the
text at all!!
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I am currently stuck in
the PDF specifications 1.5 and
> really
> >>>>>>>> running
> >>>>>>>>>>>> out
> >>>>>>>>>>>>>>> of time.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> I'd so much appreciate any
help or any idea on what's going
> >> on.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Notes:
> >>>>>>>>>>>>>>> 1. I use use PDFBox 1.7.1
> >>>>>>>>>>>>>>> 2. This problem does not
occur with all PDFs, only some
> PDFs
> >>>>>> cause
> >>>>>>>>>>>> this
> >>>>>>>>>>>>>>> problem.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Thank you very much.
> >>>>>>>>>>>>>>> a7mad
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>
> >> ---------------------------------------------------------------------
> >>>>>>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >>>>>>>>>>>>>> For additional commands, e-mail:
> users-help@pdfbox.apache.org
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >>>>>>>>>>>>> For additional commands, e-mail:
> users-help@pdfbox.apache.org
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>> ---------------------------------------------------------------------
> >>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >> ---------------------------------------------------------------------
> >>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >>>> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>>>
> >>>>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message