pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maruan Sahyoun <sahy...@fileaffairs.de>
Subject Re: Text removal
Date Tue, 24 Mar 2015 09:40:06 GMT

> Am 24.03.2015 um 10:36 schrieb a7med shre3y <a7med.shre3y@gmail.com>:
> 
> What are the drawing commands? I'd then investigate one how to specify the
> text ones.
> 

738.7469 167.1278 m
733.8743 167.1278 l



> On Tue, Mar 24, 2015 at 10:26 AM, Maruan Sahyoun <sahyoun@fileaffairs.de>
> wrote:
> 
>> 
>>> Am 24.03.2015 um 10:14 schrieb a7med shre3y <a7med.shre3y@gmail.com>:
>>> 
>>> That's true, I've even tried to change the rendering text mode to other
>>> values already as mentioned in the PDF specs 1.5 table 5.3 before
>> removing
>>> it also didn't work.
>>> So how to remove the graphics content then?
>> 
>> the simple answer - remove the drawing commands.
>> 
>> The longer answer as you obviously don't want to remove all drawing
>> commands you'd need to find which are the ones drawing the text. As you
>> would like to remove certain vectors which are matching a certain
>> character/glyph you first need to find out which are the ones drawing e.g.
>> the letter 'T'. I don't think that this is doable in a reasonable amount of
>> time for arbitary text.
>> 
>> Maruan
>> 
>> 
>>> 
>>> Best Regards,
>>> 
>>> On Tue, Mar 24, 2015 at 10:06 AM, Maruan Sahyoun <sahyoun@fileaffairs.de
>>> 
>>> wrote:
>>> 
>>>> Hi,
>>>> 
>>>>> Am 24.03.2015 um 09:55 schrieb a7med shre3y <a7med.shre3y@gmail.com>:
>>>>> 
>>>>> You can download it from here:
>>>>> 
>>>> 
>> https://drive.google.com/file/d/0B5Kxacm1mej-MEZubTNYVVJYTFE/view?usp=sharing
>>>>> 
>>>> 
>>>> looking more closely you correctly replaced the text, but that text was
>> in
>>>> there for searching within the PDF as it used text rendering mode 3
>>>> (invisible). The 'text' you are still seeing is drawn using vector
>> commands
>>>> so it's graphics content.
>>>> 
>>>> BR
>>>> Maruan
>>>> 
>>>> 
>>>>> Best Regards,
>>>>> 
>>>>> 
>>>>> On Tue, Mar 24, 2015 at 9:48 AM, Maruan Sahyoun <
>> sahyoun@fileaffairs.de>
>>>>> wrote:
>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> Am 24.03.2015 um 09:40 schrieb a7med shre3y <a7med.shre3y@gmail.com
>>> :
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> In fact PDFBox call the operation of transforming "7R %H $SSURYHG"
to
>>>> "To
>>>>>>> Be Approved" as "encoding". Anyway, either it's encoding or
>> decoding, I
>>>>>>> thought it's easier to transform "7R %H $SSURYHG" to "To Be Approved"
>>>> and
>>>>>>> not the opposite (or at least I don't know). I spent some quite
long
>>>> time
>>>>>>> trying to find out how to find the character codes for the glyphs
in
>>>> the
>>>>>>> currently used font, then I found that it's not an easy task.
By the
>>>> way,
>>>>>>> if you know how to do that, I'd so much appreciate it because
I need
>>>> that
>>>>>>> for replacing text with another text and for that the new text
must
>> be
>>>>>>> encoded the same way as the original!
>>>>>>> 
>>>>>>> Back to the text removal, I am able to find the text and also
remove
>> it
>>>>>> by
>>>>>>> calling reset, as I mentioned in my first email, when I print
the
>>>> output
>>>>>>> content I don't find the text anymore but I still see it when
I open
>>>> the
>>>>>>> file. My first assumption was that there must be some other way
to
>>>> remove
>>>>>>> the text other than the way I am using, and that's what you've
>> actually
>>>>>>> confirmed in your reply, so could you please tell me what still
>>>> missing?
>>>>>>> 
>>>>>> 
>>>>>> Could you upload the PDF with the reset text too?
>>>>>> 
>>>>>> BR
>>>>>> Maruan
>>>>>> 
>>>>>> 
>>>>>>> Thanks and regards,
>>>>>>> a7mad
>>>>>>> 
>>>>>>> On Tue, Mar 24, 2015 at 9:22 AM, Maruan Sahyoun <
>>>> sahyoun@fileaffairs.de>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>>> Am 24.03.2015 um 08:14 schrieb a7med shre3y <
>> a7med.shre3y@gmail.com
>>>>> :
>>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> Here's how I do it:
>>>>>>>>> 
>>>>>>>>> 1. I use the following method to encode the text:
>>>>>>>>> 
>>>>>>>>> String encode(String text, PDFont font) throws Exception
{
>>>>>>>>>    StringBuilder builder = new StringBuilder();
>>>>>>>>>    byte[] stringBytes = text.getBytes();
>>>>>>>>>    int codeLength = 1;
>>>>>>>>>    for(int i = 0; i < stringBytes.length; i += codeLength){
>>>>>>>>>            String c = font.encode(stringBytes, i, codeLength);
>>>>>>>>>            if(c == null && (i + 1 < stringBytes.length)){
>>>>>>>>>                codeLength++;
>>>>>>>>>                c = font.encode(stringBytes, i, codeLength);
>>>>>>>>>            }
>>>>>>>>>            builder.append(c);
>>>>>>>>>        }
>>>>>>>>>    return builder.toString();
>>>>>>>>> }
>>>>>>>>> 
>>>>>>>>> 2. Iterating through the tokens, I find the text either
it's a
>>>>>> COSString
>>>>>>>>> ("Tj" operator) or a COSArray ("TJ" operator) then check
if it's
>> the
>>>>>> text
>>>>>>>>> I'm looking for to remove as following:
>>>>>>>>> 
>>>>>>>>> if (op.getOperation().equals("Tj")) {
>>>>>>>>>                        COSString previous = (COSString)
>>>> tokens.get(j
>>>>>>>> -
>>>>>>>>> 1);
>>>>>>>>>                        String string = previous.getString();
>>>>>>>>>                        String encodedString = encode(string,
>> font);
>>>>>>>> 
>>>>>>>> that string is already encoded. So you'd need to encode "To
Be
>>>> Approved"
>>>>>>>> and compare if that matches the string you are reading from
the PDF.
>>>>>>>> 
>>>>>>>>>                        if(encodedString.contains("To
Be
>>>> Approved")){
>>>>>>>>>                            previous.reset();
>>>>>>>>>                        }
>>>>>>>>>                    } else if (op.getOperation().equals("TJ"))
{
>>>>>>>>>                        COSArray previous = (COSArray)
tokens.get(j
>>>> -
>>>>>>>>> 1);
>>>>>>>>>                        StringBuilder stringBuilder =
new
>>>>>>>>> StringBuilder();
>>>>>>>>>                        for (int k = 0; k < previous.size();
k++) {
>>>>>>>>>                            Object arrElement =
>>>>>> previous.getObject(k);
>>>>>>>>>                            if (arrElement instanceof
COSString) {
>>>>>>>>>                                COSString cosString =
(COSString)
>>>>>>>>> arrElement;
>>>>>>>>> 
>>>>>>>>> stringBuilder.append(cosString.getString());
>>>>>>>>>                            }
>>>>>>>>>                        }
>>>>>>>>>                        String string = stringBuilder.toString();
>>>>>>>>>                        String encodedString = encode(string,
>> font);
>>>>>>>>>                        if(encodedString.contains("To
Be
>>>> Approved")){
>>>>>>>>>                            previous.clear();
>>>>>>>>>                        }
>>>>>>>>>                    }
>>>>>>>>> 
>>>>>>>>> Note:
>>>>>>>>> In case of COSArray, I first iterate through the whole
array to get
>>>> the
>>>>>>>>> whole string before encoding and comparison and this
works.
>>>>>>>>> 
>>>>>>>>> Best Regards,
>>>>>>>>> a7mad
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Mon, Mar 23, 2015 at 10:48 PM, Maruan Sahyoun <
>>>>>> sahyoun@fileaffairs.de
>>>>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> your text is encoded so within the show text operator
Tj the
>> string
>>>> is
>>>>>>>>>> 
>>>>>>>>>> 7R %H $SSURYHG
>>>>>>>>>> 
>>>>>>>>>> You wrote that you encode your string to find it
- what do you
>> get?
>>>>>>>>>> 
>>>>>>>>>> BR
>>>>>>>>>> Maruan
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> Am 23.03.2015 um 22:01 schrieb a7med shre3y <
>>>> a7med.shre3y@gmail.com
>>>>>>> :
>>>>>>>>>>> 
>>>>>>>>>>> Hi Maruan,
>>>>>>>>>>> 
>>>>>>>>>>> Here's a link from where you can download the
PDF.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>> 
>> https://drive.google.com/file/d/0B5Kxacm1mej-bm82NzNvUXFPSmMtUjc0ZFVjVVlrODZnRzdn/view?usp=sharing
>>>>>>>>>>> 
>>>>>>>>>>> Kind Regards,
>>>>>>>>>>> a7mad
>>>>>>>>>>> 
>>>>>>>>>>> On Mon, Mar 23, 2015 at 8:57 PM, Maruan Sahyoun
<
>>>>>>>> sahyoun@fileaffairs.de>
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi,
>>>>>>>>>>>> 
>>>>>>>>>>>> you need to upload it to a public location
as the mailing list
>>>>>> doesn't
>>>>>>>>>>>> support attachments.
>>>>>>>>>>>> 
>>>>>>>>>>>> BR
>>>>>>>>>>>> Maruan
>>>>>>>>>>>> 
>>>>>>>>>>>>> Am 23.03.2015 um 19:18 schrieb a7med
shre3y <
>>>>>> a7med.shre3y@gmail.com
>>>>>>>>> :
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Dear Maruan,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thank you very much for the information.
Please find herewith
>>>>>>>> attached
>>>>>>>>>>>> the PDF to reproduce the problem.
>>>>>>>>>>>>> The text to remove is: "To Be Approved".
The text has a
>>>> multi-byte
>>>>>>>>>>>> encoding, so I call first to encode it in
order to find it then
>>>>>> remove
>>>>>>>>>> it.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Best Regards,
>>>>>>>>>>>>> a7mad
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Mon, Mar 23, 2015 at 4:13 PM,
Maruan Sahyoun <
>>>>>>>>>> sahyoun@fileaffairs.de>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> Dear a7mad,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> removing text from a PDF is not an
easy task as
>>>>>>>>>>>>>> - text which might visually appear
as a single item might
>>>>>> consistent
>>>>>>>>>> of
>>>>>>>>>>>> individual parts within the PDF itself e.g.
each character or
>>>> groups
>>>>>>>> of
>>>>>>>>>>>> characters are place individually in different
COSStrings
>>>>>>>>>>>>>> - text might be drawn using graphics
commands
>>>>>>>>>>>>>> - text can appear within different
parts of the PDF (e.g. the
>>>> text
>>>>>>>>>>>> might be content of a form field AND the
annotation representing
>>>> the
>>>>>>>>>> form
>>>>>>>>>>>> field visually)
>>>>>>>>>>>>>> - you need to look up the encoding
information to get form the
>>>>>>>>>>>> characters in the PDF "string" to the ones
you are looking for
>>>>>>>>>>>>>> ….
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> If you can post a specific PDF to
a public location and
>> describe
>>>>>> in
>>>>>>>>>>>> detail which string should have been replaced
which hasn't I
>> will
>>>> be
>>>>>>>>>> able
>>>>>>>>>>>> to tell you why that might have happened.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Maruan
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Am 23.03.2015 um 15:03 schrieb
a7med shre3y <
>>>>>>>> a7med.shre3y@gmail.com
>>>>>>>>>>> :
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Currently I am facing a strange
problem removing text from
>> the
>>>>>> some
>>>>>>>>>>>> PDFs.
>>>>>>>>>>>>>>> My program is able to find the
text and "remove it" by
>> calling
>>>>>> the
>>>>>>>>>>>>>>> COSString.reset() method.
>>>>>>>>>>>>>>> The problem is, when I open the
output PDF file, I still see
>>>> the
>>>>>>>> text
>>>>>>>>>>>> but
>>>>>>>>>>>>>>> not selectable (I mean when I
try to highlight it with the
>>>> mouse
>>>>>> to
>>>>>>>>>>>> copy
>>>>>>>>>>>>>>> it, it's not selectable!). When
print the content (tokens) of
>>>> the
>>>>>>>>>>>> output
>>>>>>>>>>>>>>> file, I DO NOT find the text
at all!!
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I am currently stuck in the PDF
specifications 1.5 and really
>>>>>>>> running
>>>>>>>>>>>> out
>>>>>>>>>>>>>>> of time.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I'd so much appreciate any help
or any idea on what's going
>> on.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Notes:
>>>>>>>>>>>>>>> 1. I use use PDFBox 1.7.1
>>>>>>>>>>>>>>> 2. This problem does not occur
with all PDFs, only some PDFs
>>>>>> cause
>>>>>>>>>>>> this
>>>>>>>>>>>>>>> problem.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Thank you very much.
>>>>>>>>>>>>>>> a7mad
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>> 
>> ---------------------------------------------------------------------
>>>>>>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>>>>>>>> For additional commands, e-mail:
users-help@pdfbox.apache.org
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>> ---------------------------------------------------------------------
>>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>> 
>>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message