pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maruan Sahyoun <sahy...@fileaffairs.de>
Subject Re: Text removal
Date Tue, 24 Mar 2015 09:26:32 GMT

> Am 24.03.2015 um 10:14 schrieb a7med shre3y <a7med.shre3y@gmail.com>:
> 
> That's true, I've even tried to change the rendering text mode to other
> values already as mentioned in the PDF specs 1.5 table 5.3 before removing
> it also didn't work.
> So how to remove the graphics content then?

the simple answer - remove the drawing commands.

The longer answer as you obviously don't want to remove all drawing commands you'd need to
find which are the ones drawing the text. As you would like to remove certain vectors which
are matching a certain character/glyph you first need to find out which are the ones drawing
e.g. the letter 'T'. I don't think that this is doable in a reasonable amount of time for
arbitary text.

Maruan


> 
> Best Regards,
> 
> On Tue, Mar 24, 2015 at 10:06 AM, Maruan Sahyoun <sahyoun@fileaffairs.de>
> wrote:
> 
>> Hi,
>> 
>>> Am 24.03.2015 um 09:55 schrieb a7med shre3y <a7med.shre3y@gmail.com>:
>>> 
>>> You can download it from here:
>>> 
>> https://drive.google.com/file/d/0B5Kxacm1mej-MEZubTNYVVJYTFE/view?usp=sharing
>>> 
>> 
>> looking more closely you correctly replaced the text, but that text was in
>> there for searching within the PDF as it used text rendering mode 3
>> (invisible). The 'text' you are still seeing is drawn using vector commands
>> so it's graphics content.
>> 
>> BR
>> Maruan
>> 
>> 
>>> Best Regards,
>>> 
>>> 
>>> On Tue, Mar 24, 2015 at 9:48 AM, Maruan Sahyoun <sahyoun@fileaffairs.de>
>>> wrote:
>>> 
>>>> 
>>>> 
>>>>> Am 24.03.2015 um 09:40 schrieb a7med shre3y <a7med.shre3y@gmail.com>:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> In fact PDFBox call the operation of transforming "7R %H $SSURYHG" to
>> "To
>>>>> Be Approved" as "encoding". Anyway, either it's encoding or decoding,
I
>>>>> thought it's easier to transform "7R %H $SSURYHG" to "To Be Approved"
>> and
>>>>> not the opposite (or at least I don't know). I spent some quite long
>> time
>>>>> trying to find out how to find the character codes for the glyphs in
>> the
>>>>> currently used font, then I found that it's not an easy task. By the
>> way,
>>>>> if you know how to do that, I'd so much appreciate it because I need
>> that
>>>>> for replacing text with another text and for that the new text must be
>>>>> encoded the same way as the original!
>>>>> 
>>>>> Back to the text removal, I am able to find the text and also remove
it
>>>> by
>>>>> calling reset, as I mentioned in my first email, when I print the
>> output
>>>>> content I don't find the text anymore but I still see it when I open
>> the
>>>>> file. My first assumption was that there must be some other way to
>> remove
>>>>> the text other than the way I am using, and that's what you've actually
>>>>> confirmed in your reply, so could you please tell me what still
>> missing?
>>>>> 
>>>> 
>>>> Could you upload the PDF with the reset text too?
>>>> 
>>>> BR
>>>> Maruan
>>>> 
>>>> 
>>>>> Thanks and regards,
>>>>> a7mad
>>>>> 
>>>>> On Tue, Mar 24, 2015 at 9:22 AM, Maruan Sahyoun <
>> sahyoun@fileaffairs.de>
>>>>> wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>>> Am 24.03.2015 um 08:14 schrieb a7med shre3y <a7med.shre3y@gmail.com
>>> :
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> Here's how I do it:
>>>>>>> 
>>>>>>> 1. I use the following method to encode the text:
>>>>>>> 
>>>>>>> String encode(String text, PDFont font) throws Exception {
>>>>>>>     StringBuilder builder = new StringBuilder();
>>>>>>>     byte[] stringBytes = text.getBytes();
>>>>>>>     int codeLength = 1;
>>>>>>>     for(int i = 0; i < stringBytes.length; i += codeLength){
>>>>>>>             String c = font.encode(stringBytes, i, codeLength);
>>>>>>>             if(c == null && (i + 1 < stringBytes.length)){
>>>>>>>                 codeLength++;
>>>>>>>                 c = font.encode(stringBytes, i, codeLength);
>>>>>>>             }
>>>>>>>             builder.append(c);
>>>>>>>         }
>>>>>>>     return builder.toString();
>>>>>>> }
>>>>>>> 
>>>>>>> 2. Iterating through the tokens, I find the text either it's
a
>>>> COSString
>>>>>>> ("Tj" operator) or a COSArray ("TJ" operator) then check if it's
the
>>>> text
>>>>>>> I'm looking for to remove as following:
>>>>>>> 
>>>>>>> if (op.getOperation().equals("Tj")) {
>>>>>>>                         COSString previous = (COSString)
>> tokens.get(j
>>>>>> -
>>>>>>> 1);
>>>>>>>                         String string = previous.getString();
>>>>>>>                         String encodedString = encode(string,
font);
>>>>>> 
>>>>>> that string is already encoded. So you'd need to encode "To Be
>> Approved"
>>>>>> and compare if that matches the string you are reading from the PDF.
>>>>>> 
>>>>>>>                         if(encodedString.contains("To Be
>> Approved")){
>>>>>>>                             previous.reset();
>>>>>>>                         }
>>>>>>>                     } else if (op.getOperation().equals("TJ"))
{
>>>>>>>                         COSArray previous = (COSArray) tokens.get(j
>> -
>>>>>>> 1);
>>>>>>>                         StringBuilder stringBuilder = new
>>>>>>> StringBuilder();
>>>>>>>                         for (int k = 0; k < previous.size();
k++) {
>>>>>>>                             Object arrElement =
>>>> previous.getObject(k);
>>>>>>>                             if (arrElement instanceof COSString)
{
>>>>>>>                                 COSString cosString = (COSString)
>>>>>>> arrElement;
>>>>>>> 
>>>>>>> stringBuilder.append(cosString.getString());
>>>>>>>                             }
>>>>>>>                         }
>>>>>>>                         String string = stringBuilder.toString();
>>>>>>>                         String encodedString = encode(string,
font);
>>>>>>>                         if(encodedString.contains("To Be
>> Approved")){
>>>>>>>                             previous.clear();
>>>>>>>                         }
>>>>>>>                     }
>>>>>>> 
>>>>>>> Note:
>>>>>>> In case of COSArray, I first iterate through the whole array
to get
>> the
>>>>>>> whole string before encoding and comparison and this works.
>>>>>>> 
>>>>>>> Best Regards,
>>>>>>> a7mad
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Mon, Mar 23, 2015 at 10:48 PM, Maruan Sahyoun <
>>>> sahyoun@fileaffairs.de
>>>>>>> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> your text is encoded so within the show text operator Tj
the string
>> is
>>>>>>>> 
>>>>>>>> 7R %H $SSURYHG
>>>>>>>> 
>>>>>>>> You wrote that you encode your string to find it - what do
you get?
>>>>>>>> 
>>>>>>>> BR
>>>>>>>> Maruan
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> Am 23.03.2015 um 22:01 schrieb a7med shre3y <
>> a7med.shre3y@gmail.com
>>>>> :
>>>>>>>>> 
>>>>>>>>> Hi Maruan,
>>>>>>>>> 
>>>>>>>>> Here's a link from where you can download the PDF.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>> 
>> https://drive.google.com/file/d/0B5Kxacm1mej-bm82NzNvUXFPSmMtUjc0ZFVjVVlrODZnRzdn/view?usp=sharing
>>>>>>>>> 
>>>>>>>>> Kind Regards,
>>>>>>>>> a7mad
>>>>>>>>> 
>>>>>>>>> On Mon, Mar 23, 2015 at 8:57 PM, Maruan Sahyoun <
>>>>>> sahyoun@fileaffairs.de>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> you need to upload it to a public location as the
mailing list
>>>> doesn't
>>>>>>>>>> support attachments.
>>>>>>>>>> 
>>>>>>>>>> BR
>>>>>>>>>> Maruan
>>>>>>>>>> 
>>>>>>>>>>> Am 23.03.2015 um 19:18 schrieb a7med shre3y <
>>>> a7med.shre3y@gmail.com
>>>>>>> :
>>>>>>>>>>> 
>>>>>>>>>>> Dear Maruan,
>>>>>>>>>>> 
>>>>>>>>>>> Thank you very much for the information. Please
find herewith
>>>>>> attached
>>>>>>>>>> the PDF to reproduce the problem.
>>>>>>>>>>> The text to remove is: "To Be Approved". The
text has a
>> multi-byte
>>>>>>>>>> encoding, so I call first to encode it in order to
find it then
>>>> remove
>>>>>>>> it.
>>>>>>>>>>> 
>>>>>>>>>>> Best Regards,
>>>>>>>>>>> a7mad
>>>>>>>>>>> 
>>>>>>>>>>>> On Mon, Mar 23, 2015 at 4:13 PM, Maruan Sahyoun
<
>>>>>>>> sahyoun@fileaffairs.de>
>>>>>>>>>> wrote:
>>>>>>>>>>>> Dear a7mad,
>>>>>>>>>>>> 
>>>>>>>>>>>> removing text from a PDF is not an easy task
as
>>>>>>>>>>>> - text which might visually appear as a single
item might
>>>> consistent
>>>>>>>> of
>>>>>>>>>> individual parts within the PDF itself e.g. each
character or
>> groups
>>>>>> of
>>>>>>>>>> characters are place individually in different COSStrings
>>>>>>>>>>>> - text might be drawn using graphics commands
>>>>>>>>>>>> - text can appear within different parts
of the PDF (e.g. the
>> text
>>>>>>>>>> might be content of a form field AND the annotation
representing
>> the
>>>>>>>> form
>>>>>>>>>> field visually)
>>>>>>>>>>>> - you need to look up the encoding information
to get form the
>>>>>>>>>> characters in the PDF "string" to the ones you are
looking for
>>>>>>>>>>>> ….
>>>>>>>>>>>> 
>>>>>>>>>>>> If you can post a specific PDF to a public
location and describe
>>>> in
>>>>>>>>>> detail which string should have been replaced which
hasn't I will
>> be
>>>>>>>> able
>>>>>>>>>> to tell you why that might have happened.
>>>>>>>>>>>> 
>>>>>>>>>>>> Maruan
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> Am 23.03.2015 um 15:03 schrieb a7med
shre3y <
>>>>>> a7med.shre3y@gmail.com
>>>>>>>>> :
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Currently I am facing a strange problem
removing text from the
>>>> some
>>>>>>>>>> PDFs.
>>>>>>>>>>>>> My program is able to find the text and
"remove it" by calling
>>>> the
>>>>>>>>>>>>> COSString.reset() method.
>>>>>>>>>>>>> The problem is, when I open the output
PDF file, I still see
>> the
>>>>>> text
>>>>>>>>>> but
>>>>>>>>>>>>> not selectable (I mean when I try to
highlight it with the
>> mouse
>>>> to
>>>>>>>>>> copy
>>>>>>>>>>>>> it, it's not selectable!). When print
the content (tokens) of
>> the
>>>>>>>>>> output
>>>>>>>>>>>>> file, I DO NOT find the text at all!!
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I am currently stuck in the PDF specifications
1.5 and really
>>>>>> running
>>>>>>>>>> out
>>>>>>>>>>>>> of time.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I'd so much appreciate any help or any
idea on what's going on.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Notes:
>>>>>>>>>>>>> 1. I use use PDFBox 1.7.1
>>>>>>>>>>>>> 2. This problem does not occur with all
PDFs, only some PDFs
>>>> cause
>>>>>>>>>> this
>>>>>>>>>>>>> problem.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thank you very much.
>>>>>>>>>>>>> a7mad
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>> ---------------------------------------------------------------------
>>>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>> 
>>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message