pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohit Srivastava <mohit.srivast...@magicsw.com>
Subject Text not removed from Tokens
Date Mon, 13 Apr 2015 06:16:48 GMT
Hi,

We are trying to remove the PDF page text content using the Token having
"TJ" and "Tj" and able to do so for most of the text, but there are few
 tokens which contains nested text which is not removed.While investigation
we have observed that the text which is not removed  relates to the
PDFOperator{do} preceeded by COSName{FM*} shows such problem.


Please provide help on the same. Below is the code used:

code:

PDDocument document = null;
                  try
                  {
                      document = PDDocument.load("D:\\\BiologyShort.pdf");
                      if( document.isEncrypted() )
                      {
                          System.err.println( "Error: Encrypted documents
are not supported for this example." );
                          System.exit( 1 );
                      }
                      List allPages =
document.getDocumentCatalog().getAllPages();
                      for( int i=0; i<allPages.size(); i++ )
                      {
                          PDPage page = (PDPage)allPages.get( i );
                          PDFStreamParser parser = new
PDFStreamParser(page.getContents());
                          parser.parse();

                          List tokens = parser.getTokens();


                          List newTokens = new ArrayList();
                          for( int j=0; j<tokens.size(); j++)
                          {
                              Object token = tokens.get( j );

                              if( token instanceof PDFOperator )
                              {
                                  PDFOperator op = (PDFOperator)token;
                                  if( op.getOperation().equals( "Tj" ) ||
op.getOperation().equals( "TJ" ))
                                  {
                                      //remove the one argument to this
operator
                                      newTokens.remove( newTokens.size() -1
);
                                      continue;
                                  }
                              }

                              newTokens.add( token );

                          }
                          PDStream newContents = new PDStream( document );
                          ContentStreamWriter writer = new
ContentStreamWriter( newContents.createOutputStream() );
                          writer.writeTokens( newTokens );
                          newContents.addCompression();
                          page.setContents( newContents );
                      }
                          document.save("D:\\BiologyShortNew.pdf" );
                 }
                 catch(Exception e){
                 }
                 finally
                 {
                     if( document != null )
                     {
                         document.close();
                     }
                 }

regards,
Mohit

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message