pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: Text not removed from Tokens
Date Mon, 13 Apr 2015 06:35:50 GMT
You have somewhat answered your own question: the FM* objects are 
XForms. This is an instance of the XObjects, which you get when going 
through the page resources. Each XForm has a content stream, which you 
can then alter similarly to what you did with the page content stream.

Btw the XForms can again have resources, these can again have XForms...

(This is what I wrote for you in PDFBOX-2754. However I don't see that 
you try to go through the resources to identify any XForms?!)

Tilman


Am 13.04.2015 um 08:16 schrieb Mohit Srivastava:
> Hi,
>
> We are trying to remove the PDF page text content using the Token having
> "TJ" and "Tj" and able to do so for most of the text, but there are few
>   tokens which contains nested text which is not removed.While investigation
> we have observed that the text which is not removed  relates to the
> PDFOperator{do} preceeded by COSName{FM*} shows such problem.
>
>
> Please provide help on the same. Below is the code used:
>
> code:
>
> PDDocument document = null;
>                    try
>                    {
>                        document = PDDocument.load("D:\\\BiologyShort.pdf");
>                        if( document.isEncrypted() )
>                        {
>                            System.err.println( "Error: Encrypted documents
> are not supported for this example." );
>                            System.exit( 1 );
>                        }
>                        List allPages =
> document.getDocumentCatalog().getAllPages();
>                        for( int i=0; i<allPages.size(); i++ )
>                        {
>                            PDPage page = (PDPage)allPages.get( i );
>                            PDFStreamParser parser = new
> PDFStreamParser(page.getContents());
>                            parser.parse();
>
>                            List tokens = parser.getTokens();
>
>
>                            List newTokens = new ArrayList();
>                            for( int j=0; j<tokens.size(); j++)
>                            {
>                                Object token = tokens.get( j );
>
>                                if( token instanceof PDFOperator )
>                                {
>                                    PDFOperator op = (PDFOperator)token;
>                                    if( op.getOperation().equals( "Tj" ) ||
> op.getOperation().equals( "TJ" ))
>                                    {
>                                        //remove the one argument to this
> operator
>                                        newTokens.remove( newTokens.size() -1
> );
>                                        continue;
>                                    }
>                                }
>
>                                newTokens.add( token );
>
>                            }
>                            PDStream newContents = new PDStream( document );
>                            ContentStreamWriter writer = new
> ContentStreamWriter( newContents.createOutputStream() );
>                            writer.writeTokens( newTokens );
>                            newContents.addCompression();
>                            page.setContents( newContents );
>                        }
>                            document.save("D:\\BiologyShortNew.pdf" );
>                   }
>                   catch(Exception e){
>                   }
>                   finally
>                   {
>                       if( document != null )
>                       {
>                           document.close();
>                       }
>                   }
>
> regards,
> Mohit
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message