pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From a7med shre3y <a7med.shr...@gmail.com>
Subject Re: Text removal
Date Tue, 24 Mar 2015 07:14:06 GMT
Hi,

Here's how I do it:

1. I use the following method to encode the text:

String encode(String text, PDFont font) throws Exception {
        StringBuilder builder = new StringBuilder();
        byte[] stringBytes = text.getBytes();
        int codeLength = 1;
        for(int i = 0; i < stringBytes.length; i += codeLength){
                String c = font.encode(stringBytes, i, codeLength);
                if(c == null && (i + 1 < stringBytes.length)){
                    codeLength++;
                    c = font.encode(stringBytes, i, codeLength);
                }
                builder.append(c);
            }
        return builder.toString();
    }

2. Iterating through the tokens, I find the text either it's a COSString
("Tj" operator) or a COSArray ("TJ" operator) then check if it's the text
I'm looking for to remove as following:

if (op.getOperation().equals("Tj")) {
                            COSString previous = (COSString) tokens.get(j -
1);
                            String string = previous.getString();
                            String encodedString = encode(string, font);
                            if(encodedString.contains("To Be Approved")){
                                previous.reset();
                            }
                        } else if (op.getOperation().equals("TJ")) {
                            COSArray previous = (COSArray) tokens.get(j -
1);
                            StringBuilder stringBuilder = new
StringBuilder();
                            for (int k = 0; k < previous.size(); k++) {
                                Object arrElement = previous.getObject(k);
                                if (arrElement instanceof COSString) {
                                    COSString cosString = (COSString)
arrElement;

stringBuilder.append(cosString.getString());
                                }
                            }
                            String string = stringBuilder.toString();
                            String encodedString = encode(string, font);
                            if(encodedString.contains("To Be Approved")){
                                previous.clear();
                            }
                        }

Note:
In case of COSArray, I first iterate through the whole array to get the
whole string before encoding and comparison and this works.

Best Regards,
a7mad



On Mon, Mar 23, 2015 at 10:48 PM, Maruan Sahyoun <sahyoun@fileaffairs.de>
wrote:

> Hi,
>
> your text is encoded so within the show text operator Tj the string is
>
> 7R %H $SSURYHG
>
> You wrote that you encode your string to find it - what do you get?
>
> BR
> Maruan
>
>
>
> > Am 23.03.2015 um 22:01 schrieb a7med shre3y <a7med.shre3y@gmail.com>:
> >
> > Hi Maruan,
> >
> > Here's a link from where you can download the PDF.
> >
> >
> https://drive.google.com/file/d/0B5Kxacm1mej-bm82NzNvUXFPSmMtUjc0ZFVjVVlrODZnRzdn/view?usp=sharing
> >
> > Kind Regards,
> > a7mad
> >
> > On Mon, Mar 23, 2015 at 8:57 PM, Maruan Sahyoun <sahyoun@fileaffairs.de>
> > wrote:
> >
> >> Hi,
> >>
> >> you need to upload it to a public location as the mailing list doesn't
> >> support attachments.
> >>
> >> BR
> >> Maruan
> >>
> >>> Am 23.03.2015 um 19:18 schrieb a7med shre3y <a7med.shre3y@gmail.com>:
> >>>
> >>> Dear Maruan,
> >>>
> >>> Thank you very much for the information. Please find herewith attached
> >> the PDF to reproduce the problem.
> >>> The text to remove is: "To Be Approved". The text has a multi-byte
> >> encoding, so I call first to encode it in order to find it then remove
> it.
> >>>
> >>> Best Regards,
> >>> a7mad
> >>>
> >>>> On Mon, Mar 23, 2015 at 4:13 PM, Maruan Sahyoun <
> sahyoun@fileaffairs.de>
> >> wrote:
> >>>> Dear a7mad,
> >>>>
> >>>> removing text from a PDF is not an easy task as
> >>>> - text which might visually appear as a single item might consistent
> of
> >> individual parts within the PDF itself e.g. each character or groups of
> >> characters are place individually in different COSStrings
> >>>> - text might be drawn using graphics commands
> >>>> - text can appear within different parts of the PDF (e.g. the text
> >> might be content of a form field AND the annotation representing the
> form
> >> field visually)
> >>>> - you need to look up the encoding information to get form the
> >> characters in the PDF "string" to the ones you are looking for
> >>>> ….
> >>>>
> >>>> If you can post a specific PDF to a public location and describe in
> >> detail which string should have been replaced which hasn't I will be
> able
> >> to tell you why that might have happened.
> >>>>
> >>>> Maruan
> >>>>
> >>>>
> >>>>> Am 23.03.2015 um 15:03 schrieb a7med shre3y <a7med.shre3y@gmail.com
> >:
> >>>>>
> >>>>> Hi all,
> >>>>>
> >>>>> Currently I am facing a strange problem removing text from the some
> >> PDFs.
> >>>>> My program is able to find the text and "remove it" by calling the
> >>>>> COSString.reset() method.
> >>>>> The problem is, when I open the output PDF file, I still see the
text
> >> but
> >>>>> not selectable (I mean when I try to highlight it with the mouse
to
> >> copy
> >>>>> it, it's not selectable!). When print the content (tokens) of the
> >> output
> >>>>> file, I DO NOT find the text at all!!
> >>>>>
> >>>>> I am currently stuck in the PDF specifications 1.5 and really running
> >> out
> >>>>> of time.
> >>>>>
> >>>>> I'd so much appreciate any help or any idea on what's going on.
> >>>>>
> >>>>> Notes:
> >>>>> 1. I use use PDFBox 1.7.1
> >>>>> 2. This problem does not occur with all PDFs, only some PDFs cause
> >> this
> >>>>> problem.
> >>>>>
> >>>>> Thank you very much.
> >>>>> a7mad
> >>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >>>> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >>> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message