pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From a7med shre3y <a7med.shr...@gmail.com>
Subject Re: Text removal
Date Tue, 24 Mar 2015 09:36:42 GMT
What are the drawing commands? I'd then investigate one how to specify the
text ones.

On Tue, Mar 24, 2015 at 10:26 AM, Maruan Sahyoun <sahyoun@fileaffairs.de>
wrote:

>
> > Am 24.03.2015 um 10:14 schrieb a7med shre3y <a7med.shre3y@gmail.com>:
> >
> > That's true, I've even tried to change the rendering text mode to other
> > values already as mentioned in the PDF specs 1.5 table 5.3 before
> removing
> > it also didn't work.
> > So how to remove the graphics content then?
>
> the simple answer - remove the drawing commands.
>
> The longer answer as you obviously don't want to remove all drawing
> commands you'd need to find which are the ones drawing the text. As you
> would like to remove certain vectors which are matching a certain
> character/glyph you first need to find out which are the ones drawing e.g.
> the letter 'T'. I don't think that this is doable in a reasonable amount of
> time for arbitary text.
>
> Maruan
>
>
> >
> > Best Regards,
> >
> > On Tue, Mar 24, 2015 at 10:06 AM, Maruan Sahyoun <sahyoun@fileaffairs.de
> >
> > wrote:
> >
> >> Hi,
> >>
> >>> Am 24.03.2015 um 09:55 schrieb a7med shre3y <a7med.shre3y@gmail.com>:
> >>>
> >>> You can download it from here:
> >>>
> >>
> https://drive.google.com/file/d/0B5Kxacm1mej-MEZubTNYVVJYTFE/view?usp=sharing
> >>>
> >>
> >> looking more closely you correctly replaced the text, but that text was
> in
> >> there for searching within the PDF as it used text rendering mode 3
> >> (invisible). The 'text' you are still seeing is drawn using vector
> commands
> >> so it's graphics content.
> >>
> >> BR
> >> Maruan
> >>
> >>
> >>> Best Regards,
> >>>
> >>>
> >>> On Tue, Mar 24, 2015 at 9:48 AM, Maruan Sahyoun <
> sahyoun@fileaffairs.de>
> >>> wrote:
> >>>
> >>>>
> >>>>
> >>>>> Am 24.03.2015 um 09:40 schrieb a7med shre3y <a7med.shre3y@gmail.com
> >:
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> In fact PDFBox call the operation of transforming "7R %H $SSURYHG"
to
> >> "To
> >>>>> Be Approved" as "encoding". Anyway, either it's encoding or
> decoding, I
> >>>>> thought it's easier to transform "7R %H $SSURYHG" to "To Be Approved"
> >> and
> >>>>> not the opposite (or at least I don't know). I spent some quite
long
> >> time
> >>>>> trying to find out how to find the character codes for the glyphs
in
> >> the
> >>>>> currently used font, then I found that it's not an easy task. By
the
> >> way,
> >>>>> if you know how to do that, I'd so much appreciate it because I
need
> >> that
> >>>>> for replacing text with another text and for that the new text must
> be
> >>>>> encoded the same way as the original!
> >>>>>
> >>>>> Back to the text removal, I am able to find the text and also remove
> it
> >>>> by
> >>>>> calling reset, as I mentioned in my first email, when I print the
> >> output
> >>>>> content I don't find the text anymore but I still see it when I
open
> >> the
> >>>>> file. My first assumption was that there must be some other way
to
> >> remove
> >>>>> the text other than the way I am using, and that's what you've
> actually
> >>>>> confirmed in your reply, so could you please tell me what still
> >> missing?
> >>>>>
> >>>>
> >>>> Could you upload the PDF with the reset text too?
> >>>>
> >>>> BR
> >>>> Maruan
> >>>>
> >>>>
> >>>>> Thanks and regards,
> >>>>> a7mad
> >>>>>
> >>>>> On Tue, Mar 24, 2015 at 9:22 AM, Maruan Sahyoun <
> >> sahyoun@fileaffairs.de>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>>> Am 24.03.2015 um 08:14 schrieb a7med shre3y <
> a7med.shre3y@gmail.com
> >>> :
> >>>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> Here's how I do it:
> >>>>>>>
> >>>>>>> 1. I use the following method to encode the text:
> >>>>>>>
> >>>>>>> String encode(String text, PDFont font) throws Exception
{
> >>>>>>>     StringBuilder builder = new StringBuilder();
> >>>>>>>     byte[] stringBytes = text.getBytes();
> >>>>>>>     int codeLength = 1;
> >>>>>>>     for(int i = 0; i < stringBytes.length; i += codeLength){
> >>>>>>>             String c = font.encode(stringBytes, i, codeLength);
> >>>>>>>             if(c == null && (i + 1 < stringBytes.length)){
> >>>>>>>                 codeLength++;
> >>>>>>>                 c = font.encode(stringBytes, i, codeLength);
> >>>>>>>             }
> >>>>>>>             builder.append(c);
> >>>>>>>         }
> >>>>>>>     return builder.toString();
> >>>>>>> }
> >>>>>>>
> >>>>>>> 2. Iterating through the tokens, I find the text either
it's a
> >>>> COSString
> >>>>>>> ("Tj" operator) or a COSArray ("TJ" operator) then check
if it's
> the
> >>>> text
> >>>>>>> I'm looking for to remove as following:
> >>>>>>>
> >>>>>>> if (op.getOperation().equals("Tj")) {
> >>>>>>>                         COSString previous = (COSString)
> >> tokens.get(j
> >>>>>> -
> >>>>>>> 1);
> >>>>>>>                         String string = previous.getString();
> >>>>>>>                         String encodedString = encode(string,
> font);
> >>>>>>
> >>>>>> that string is already encoded. So you'd need to encode "To
Be
> >> Approved"
> >>>>>> and compare if that matches the string you are reading from
the PDF.
> >>>>>>
> >>>>>>>                         if(encodedString.contains("To Be
> >> Approved")){
> >>>>>>>                             previous.reset();
> >>>>>>>                         }
> >>>>>>>                     } else if (op.getOperation().equals("TJ"))
{
> >>>>>>>                         COSArray previous = (COSArray) tokens.get(j
> >> -
> >>>>>>> 1);
> >>>>>>>                         StringBuilder stringBuilder = new
> >>>>>>> StringBuilder();
> >>>>>>>                         for (int k = 0; k < previous.size();
k++) {
> >>>>>>>                             Object arrElement =
> >>>> previous.getObject(k);
> >>>>>>>                             if (arrElement instanceof COSString)
{
> >>>>>>>                                 COSString cosString = (COSString)
> >>>>>>> arrElement;
> >>>>>>>
> >>>>>>> stringBuilder.append(cosString.getString());
> >>>>>>>                             }
> >>>>>>>                         }
> >>>>>>>                         String string = stringBuilder.toString();
> >>>>>>>                         String encodedString = encode(string,
> font);
> >>>>>>>                         if(encodedString.contains("To Be
> >> Approved")){
> >>>>>>>                             previous.clear();
> >>>>>>>                         }
> >>>>>>>                     }
> >>>>>>>
> >>>>>>> Note:
> >>>>>>> In case of COSArray, I first iterate through the whole array
to get
> >> the
> >>>>>>> whole string before encoding and comparison and this works.
> >>>>>>>
> >>>>>>> Best Regards,
> >>>>>>> a7mad
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Mon, Mar 23, 2015 at 10:48 PM, Maruan Sahyoun <
> >>>> sahyoun@fileaffairs.de
> >>>>>>>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> your text is encoded so within the show text operator
Tj the
> string
> >> is
> >>>>>>>>
> >>>>>>>> 7R %H $SSURYHG
> >>>>>>>>
> >>>>>>>> You wrote that you encode your string to find it - what
do you
> get?
> >>>>>>>>
> >>>>>>>> BR
> >>>>>>>> Maruan
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>> Am 23.03.2015 um 22:01 schrieb a7med shre3y <
> >> a7med.shre3y@gmail.com
> >>>>> :
> >>>>>>>>>
> >>>>>>>>> Hi Maruan,
> >>>>>>>>>
> >>>>>>>>> Here's a link from where you can download the PDF.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>
> >>
> https://drive.google.com/file/d/0B5Kxacm1mej-bm82NzNvUXFPSmMtUjc0ZFVjVVlrODZnRzdn/view?usp=sharing
> >>>>>>>>>
> >>>>>>>>> Kind Regards,
> >>>>>>>>> a7mad
> >>>>>>>>>
> >>>>>>>>> On Mon, Mar 23, 2015 at 8:57 PM, Maruan Sahyoun
<
> >>>>>> sahyoun@fileaffairs.de>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> you need to upload it to a public location as
the mailing list
> >>>> doesn't
> >>>>>>>>>> support attachments.
> >>>>>>>>>>
> >>>>>>>>>> BR
> >>>>>>>>>> Maruan
> >>>>>>>>>>
> >>>>>>>>>>> Am 23.03.2015 um 19:18 schrieb a7med shre3y
<
> >>>> a7med.shre3y@gmail.com
> >>>>>>> :
> >>>>>>>>>>>
> >>>>>>>>>>> Dear Maruan,
> >>>>>>>>>>>
> >>>>>>>>>>> Thank you very much for the information.
Please find herewith
> >>>>>> attached
> >>>>>>>>>> the PDF to reproduce the problem.
> >>>>>>>>>>> The text to remove is: "To Be Approved".
The text has a
> >> multi-byte
> >>>>>>>>>> encoding, so I call first to encode it in order
to find it then
> >>>> remove
> >>>>>>>> it.
> >>>>>>>>>>>
> >>>>>>>>>>> Best Regards,
> >>>>>>>>>>> a7mad
> >>>>>>>>>>>
> >>>>>>>>>>>> On Mon, Mar 23, 2015 at 4:13 PM, Maruan
Sahyoun <
> >>>>>>>> sahyoun@fileaffairs.de>
> >>>>>>>>>> wrote:
> >>>>>>>>>>>> Dear a7mad,
> >>>>>>>>>>>>
> >>>>>>>>>>>> removing text from a PDF is not an easy
task as
> >>>>>>>>>>>> - text which might visually appear as
a single item might
> >>>> consistent
> >>>>>>>> of
> >>>>>>>>>> individual parts within the PDF itself e.g.
each character or
> >> groups
> >>>>>> of
> >>>>>>>>>> characters are place individually in different
COSStrings
> >>>>>>>>>>>> - text might be drawn using graphics
commands
> >>>>>>>>>>>> - text can appear within different parts
of the PDF (e.g. the
> >> text
> >>>>>>>>>> might be content of a form field AND the annotation
representing
> >> the
> >>>>>>>> form
> >>>>>>>>>> field visually)
> >>>>>>>>>>>> - you need to look up the encoding information
to get form the
> >>>>>>>>>> characters in the PDF "string" to the ones you
are looking for
> >>>>>>>>>>>> ….
> >>>>>>>>>>>>
> >>>>>>>>>>>> If you can post a specific PDF to a
public location and
> describe
> >>>> in
> >>>>>>>>>> detail which string should have been replaced
which hasn't I
> will
> >> be
> >>>>>>>> able
> >>>>>>>>>> to tell you why that might have happened.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Maruan
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Am 23.03.2015 um 15:03 schrieb a7med
shre3y <
> >>>>>> a7med.shre3y@gmail.com
> >>>>>>>>> :
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Currently I am facing a strange
problem removing text from
> the
> >>>> some
> >>>>>>>>>> PDFs.
> >>>>>>>>>>>>> My program is able to find the text
and "remove it" by
> calling
> >>>> the
> >>>>>>>>>>>>> COSString.reset() method.
> >>>>>>>>>>>>> The problem is, when I open the
output PDF file, I still see
> >> the
> >>>>>> text
> >>>>>>>>>> but
> >>>>>>>>>>>>> not selectable (I mean when I try
to highlight it with the
> >> mouse
> >>>> to
> >>>>>>>>>> copy
> >>>>>>>>>>>>> it, it's not selectable!). When
print the content (tokens) of
> >> the
> >>>>>>>>>> output
> >>>>>>>>>>>>> file, I DO NOT find the text at
all!!
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I am currently stuck in the PDF
specifications 1.5 and really
> >>>>>> running
> >>>>>>>>>> out
> >>>>>>>>>>>>> of time.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I'd so much appreciate any help
or any idea on what's going
> on.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Notes:
> >>>>>>>>>>>>> 1. I use use PDFBox 1.7.1
> >>>>>>>>>>>>> 2. This problem does not occur with
all PDFs, only some PDFs
> >>>> cause
> >>>>>>>>>> this
> >>>>>>>>>>>>> problem.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Thank you very much.
> >>>>>>>>>>>>> a7mad
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >>>>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>> ---------------------------------------------------------------------
> >>>>>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >>>>>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >> ---------------------------------------------------------------------
> >>>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >>>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >>>> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>>>
> >>>>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message