pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From a7med shre3y <a7med.shr...@gmail.com>
Subject Re: Text removal
Date Tue, 24 Mar 2015 08:40:42 GMT
Hi,

In fact PDFBox call the operation of transforming "7R %H $SSURYHG" to "To
Be Approved" as "encoding". Anyway, either it's encoding or decoding, I
thought it's easier to transform "7R %H $SSURYHG" to "To Be Approved" and
not the opposite (or at least I don't know). I spent some quite long time
trying to find out how to find the character codes for the glyphs in the
currently used font, then I found that it's not an easy task. By the way,
if you know how to do that, I'd so much appreciate it because I need that
for replacing text with another text and for that the new text must be
encoded the same way as the original!

Back to the text removal, I am able to find the text and also remove it by
calling reset, as I mentioned in my first email, when I print the output
content I don't find the text anymore but I still see it when I open the
file. My first assumption was that there must be some other way to remove
the text other than the way I am using, and that's what you've actually
confirmed in your reply, so could you please tell me what still missing?

Thanks and regards,
a7mad

On Tue, Mar 24, 2015 at 9:22 AM, Maruan Sahyoun <sahyoun@fileaffairs.de>
wrote:

> Hi,
>
> > Am 24.03.2015 um 08:14 schrieb a7med shre3y <a7med.shre3y@gmail.com>:
> >
> > Hi,
> >
> > Here's how I do it:
> >
> > 1. I use the following method to encode the text:
> >
> > String encode(String text, PDFont font) throws Exception {
> >        StringBuilder builder = new StringBuilder();
> >        byte[] stringBytes = text.getBytes();
> >        int codeLength = 1;
> >        for(int i = 0; i < stringBytes.length; i += codeLength){
> >                String c = font.encode(stringBytes, i, codeLength);
> >                if(c == null && (i + 1 < stringBytes.length)){
> >                    codeLength++;
> >                    c = font.encode(stringBytes, i, codeLength);
> >                }
> >                builder.append(c);
> >            }
> >        return builder.toString();
> >    }
> >
> > 2. Iterating through the tokens, I find the text either it's a COSString
> > ("Tj" operator) or a COSArray ("TJ" operator) then check if it's the text
> > I'm looking for to remove as following:
> >
> > if (op.getOperation().equals("Tj")) {
> >                            COSString previous = (COSString) tokens.get(j
> -
> > 1);
> >                            String string = previous.getString();
> >                            String encodedString = encode(string, font);
>
> that string is already encoded. So you'd need to encode "To Be Approved"
> and compare if that matches the string you are reading from the PDF.
>
> >                            if(encodedString.contains("To Be Approved")){
> >                                previous.reset();
> >                            }
> >                        } else if (op.getOperation().equals("TJ")) {
> >                            COSArray previous = (COSArray) tokens.get(j -
> > 1);
> >                            StringBuilder stringBuilder = new
> > StringBuilder();
> >                            for (int k = 0; k < previous.size(); k++) {
> >                                Object arrElement = previous.getObject(k);
> >                                if (arrElement instanceof COSString) {
> >                                    COSString cosString = (COSString)
> > arrElement;
> >
> > stringBuilder.append(cosString.getString());
> >                                }
> >                            }
> >                            String string = stringBuilder.toString();
> >                            String encodedString = encode(string, font);
> >                            if(encodedString.contains("To Be Approved")){
> >                                previous.clear();
> >                            }
> >                        }
> >
> > Note:
> > In case of COSArray, I first iterate through the whole array to get the
> > whole string before encoding and comparison and this works.
> >
> > Best Regards,
> > a7mad
> >
> >
> >
> > On Mon, Mar 23, 2015 at 10:48 PM, Maruan Sahyoun <sahyoun@fileaffairs.de
> >
> > wrote:
> >
> >> Hi,
> >>
> >> your text is encoded so within the show text operator Tj the string is
> >>
> >> 7R %H $SSURYHG
> >>
> >> You wrote that you encode your string to find it - what do you get?
> >>
> >> BR
> >> Maruan
> >>
> >>
> >>
> >>> Am 23.03.2015 um 22:01 schrieb a7med shre3y <a7med.shre3y@gmail.com>:
> >>>
> >>> Hi Maruan,
> >>>
> >>> Here's a link from where you can download the PDF.
> >>>
> >>>
> >>
> https://drive.google.com/file/d/0B5Kxacm1mej-bm82NzNvUXFPSmMtUjc0ZFVjVVlrODZnRzdn/view?usp=sharing
> >>>
> >>> Kind Regards,
> >>> a7mad
> >>>
> >>> On Mon, Mar 23, 2015 at 8:57 PM, Maruan Sahyoun <
> sahyoun@fileaffairs.de>
> >>> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> you need to upload it to a public location as the mailing list doesn't
> >>>> support attachments.
> >>>>
> >>>> BR
> >>>> Maruan
> >>>>
> >>>>> Am 23.03.2015 um 19:18 schrieb a7med shre3y <a7med.shre3y@gmail.com
> >:
> >>>>>
> >>>>> Dear Maruan,
> >>>>>
> >>>>> Thank you very much for the information. Please find herewith
> attached
> >>>> the PDF to reproduce the problem.
> >>>>> The text to remove is: "To Be Approved". The text has a multi-byte
> >>>> encoding, so I call first to encode it in order to find it then remove
> >> it.
> >>>>>
> >>>>> Best Regards,
> >>>>> a7mad
> >>>>>
> >>>>>> On Mon, Mar 23, 2015 at 4:13 PM, Maruan Sahyoun <
> >> sahyoun@fileaffairs.de>
> >>>> wrote:
> >>>>>> Dear a7mad,
> >>>>>>
> >>>>>> removing text from a PDF is not an easy task as
> >>>>>> - text which might visually appear as a single item might consistent
> >> of
> >>>> individual parts within the PDF itself e.g. each character or groups
> of
> >>>> characters are place individually in different COSStrings
> >>>>>> - text might be drawn using graphics commands
> >>>>>> - text can appear within different parts of the PDF (e.g. the
text
> >>>> might be content of a form field AND the annotation representing the
> >> form
> >>>> field visually)
> >>>>>> - you need to look up the encoding information to get form the
> >>>> characters in the PDF "string" to the ones you are looking for
> >>>>>> ….
> >>>>>>
> >>>>>> If you can post a specific PDF to a public location and describe
in
> >>>> detail which string should have been replaced which hasn't I will be
> >> able
> >>>> to tell you why that might have happened.
> >>>>>>
> >>>>>> Maruan
> >>>>>>
> >>>>>>
> >>>>>>> Am 23.03.2015 um 15:03 schrieb a7med shre3y <
> a7med.shre3y@gmail.com
> >>> :
> >>>>>>>
> >>>>>>> Hi all,
> >>>>>>>
> >>>>>>> Currently I am facing a strange problem removing text from
the some
> >>>> PDFs.
> >>>>>>> My program is able to find the text and "remove it" by calling
the
> >>>>>>> COSString.reset() method.
> >>>>>>> The problem is, when I open the output PDF file, I still
see the
> text
> >>>> but
> >>>>>>> not selectable (I mean when I try to highlight it with the
mouse to
> >>>> copy
> >>>>>>> it, it's not selectable!). When print the content (tokens)
of the
> >>>> output
> >>>>>>> file, I DO NOT find the text at all!!
> >>>>>>>
> >>>>>>> I am currently stuck in the PDF specifications 1.5 and really
> running
> >>>> out
> >>>>>>> of time.
> >>>>>>>
> >>>>>>> I'd so much appreciate any help or any idea on what's going
on.
> >>>>>>>
> >>>>>>> Notes:
> >>>>>>> 1. I use use PDFBox 1.7.1
> >>>>>>> 2. This problem does not occur with all PDFs, only some
PDFs cause
> >>>> this
> >>>>>>> problem.
> >>>>>>>
> >>>>>>> Thank you very much.
> >>>>>>> a7mad
> >>>>>>
> >>>>>>
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>>>>
> >>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>>>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message