pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maruan Sahyoun <sahy...@fileaffairs.de>
Subject Re: Text removal
Date Tue, 24 Mar 2015 08:22:25 GMT
Hi,

> Am 24.03.2015 um 08:14 schrieb a7med shre3y <a7med.shre3y@gmail.com>:
> 
> Hi,
> 
> Here's how I do it:
> 
> 1. I use the following method to encode the text:
> 
> String encode(String text, PDFont font) throws Exception {
>        StringBuilder builder = new StringBuilder();
>        byte[] stringBytes = text.getBytes();
>        int codeLength = 1;
>        for(int i = 0; i < stringBytes.length; i += codeLength){
>                String c = font.encode(stringBytes, i, codeLength);
>                if(c == null && (i + 1 < stringBytes.length)){
>                    codeLength++;
>                    c = font.encode(stringBytes, i, codeLength);
>                }
>                builder.append(c);
>            }
>        return builder.toString();
>    }
> 
> 2. Iterating through the tokens, I find the text either it's a COSString
> ("Tj" operator) or a COSArray ("TJ" operator) then check if it's the text
> I'm looking for to remove as following:
> 
> if (op.getOperation().equals("Tj")) {
>                            COSString previous = (COSString) tokens.get(j -
> 1);
>                            String string = previous.getString();
>                            String encodedString = encode(string, font);

that string is already encoded. So you'd need to encode "To Be Approved" and compare if that
matches the string you are reading from the PDF.

>                            if(encodedString.contains("To Be Approved")){
>                                previous.reset();
>                            }
>                        } else if (op.getOperation().equals("TJ")) {
>                            COSArray previous = (COSArray) tokens.get(j -
> 1);
>                            StringBuilder stringBuilder = new
> StringBuilder();
>                            for (int k = 0; k < previous.size(); k++) {
>                                Object arrElement = previous.getObject(k);
>                                if (arrElement instanceof COSString) {
>                                    COSString cosString = (COSString)
> arrElement;
> 
> stringBuilder.append(cosString.getString());
>                                }
>                            }
>                            String string = stringBuilder.toString();
>                            String encodedString = encode(string, font);
>                            if(encodedString.contains("To Be Approved")){
>                                previous.clear();
>                            }
>                        }
> 
> Note:
> In case of COSArray, I first iterate through the whole array to get the
> whole string before encoding and comparison and this works.
> 
> Best Regards,
> a7mad
> 
> 
> 
> On Mon, Mar 23, 2015 at 10:48 PM, Maruan Sahyoun <sahyoun@fileaffairs.de>
> wrote:
> 
>> Hi,
>> 
>> your text is encoded so within the show text operator Tj the string is
>> 
>> 7R %H $SSURYHG
>> 
>> You wrote that you encode your string to find it - what do you get?
>> 
>> BR
>> Maruan
>> 
>> 
>> 
>>> Am 23.03.2015 um 22:01 schrieb a7med shre3y <a7med.shre3y@gmail.com>:
>>> 
>>> Hi Maruan,
>>> 
>>> Here's a link from where you can download the PDF.
>>> 
>>> 
>> https://drive.google.com/file/d/0B5Kxacm1mej-bm82NzNvUXFPSmMtUjc0ZFVjVVlrODZnRzdn/view?usp=sharing
>>> 
>>> Kind Regards,
>>> a7mad
>>> 
>>> On Mon, Mar 23, 2015 at 8:57 PM, Maruan Sahyoun <sahyoun@fileaffairs.de>
>>> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> you need to upload it to a public location as the mailing list doesn't
>>>> support attachments.
>>>> 
>>>> BR
>>>> Maruan
>>>> 
>>>>> Am 23.03.2015 um 19:18 schrieb a7med shre3y <a7med.shre3y@gmail.com>:
>>>>> 
>>>>> Dear Maruan,
>>>>> 
>>>>> Thank you very much for the information. Please find herewith attached
>>>> the PDF to reproduce the problem.
>>>>> The text to remove is: "To Be Approved". The text has a multi-byte
>>>> encoding, so I call first to encode it in order to find it then remove
>> it.
>>>>> 
>>>>> Best Regards,
>>>>> a7mad
>>>>> 
>>>>>> On Mon, Mar 23, 2015 at 4:13 PM, Maruan Sahyoun <
>> sahyoun@fileaffairs.de>
>>>> wrote:
>>>>>> Dear a7mad,
>>>>>> 
>>>>>> removing text from a PDF is not an easy task as
>>>>>> - text which might visually appear as a single item might consistent
>> of
>>>> individual parts within the PDF itself e.g. each character or groups of
>>>> characters are place individually in different COSStrings
>>>>>> - text might be drawn using graphics commands
>>>>>> - text can appear within different parts of the PDF (e.g. the text
>>>> might be content of a form field AND the annotation representing the
>> form
>>>> field visually)
>>>>>> - you need to look up the encoding information to get form the
>>>> characters in the PDF "string" to the ones you are looking for
>>>>>> ….
>>>>>> 
>>>>>> If you can post a specific PDF to a public location and describe
in
>>>> detail which string should have been replaced which hasn't I will be
>> able
>>>> to tell you why that might have happened.
>>>>>> 
>>>>>> Maruan
>>>>>> 
>>>>>> 
>>>>>>> Am 23.03.2015 um 15:03 schrieb a7med shre3y <a7med.shre3y@gmail.com
>>> :
>>>>>>> 
>>>>>>> Hi all,
>>>>>>> 
>>>>>>> Currently I am facing a strange problem removing text from the
some
>>>> PDFs.
>>>>>>> My program is able to find the text and "remove it" by calling
the
>>>>>>> COSString.reset() method.
>>>>>>> The problem is, when I open the output PDF file, I still see
the text
>>>> but
>>>>>>> not selectable (I mean when I try to highlight it with the mouse
to
>>>> copy
>>>>>>> it, it's not selectable!). When print the content (tokens) of
the
>>>> output
>>>>>>> file, I DO NOT find the text at all!!
>>>>>>> 
>>>>>>> I am currently stuck in the PDF specifications 1.5 and really
running
>>>> out
>>>>>>> of time.
>>>>>>> 
>>>>>>> I'd so much appreciate any help or any idea on what's going on.
>>>>>>> 
>>>>>>> Notes:
>>>>>>> 1. I use use PDFBox 1.7.1
>>>>>>> 2. This problem does not occur with all PDFs, only some PDFs
cause
>>>> this
>>>>>>> problem.
>>>>>>> 
>>>>>>> Thank you very much.
>>>>>>> a7mad
>>>>>> 
>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>> 
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message