pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kodjo Afriyie - iSite Eng <kodjo.afriyi...@bbc.co.uk>
Subject RE: Unable to remove the virus associated with the file
Date Tue, 23 Jul 2019 08:00:00 GMT
Hi Tilman,

Thank you for the quick response. 
Basically below is the sample code I am using to strip out all the javascript...

I was suprised that this exploit manage to circumvent the process... The entry point is sanitize(PDDocument
pdfDoc)..

I was have been having the same problem.. the virus scan is preventing me from viewing the
file..  I will do some more research to understand where exactly is the javascript or the
mailicious code so that I can remove it...

/**
 * The following code was taken from here:
 * https://github.com/mjclemente/pdfbox.cfc
 */
public class PdfSanitizer {

    /**
     * https://stackoverflow.com/questions/14454387/pdfbox-how-to-flatten-a-pdf-form#19723539
     * @hint Flattens any forms on the pdf
     * Note that data in XFA forms is not visible after this process. Chrome/Firefox/Safari/Preview
no longer support XFA PDFs; the format seems to be on its way out and is only supported by
Adobe (via Acrobat) and IE. Adobe ColdFusion does not allow cfpdf's 'sanitize' action on PDFs
with XFA content.
     */
    protected void flatten(PDDocument pdfDoc) throws PdfSanitizationException {

        PDAcroForm acroForm = pdfDoc.getDocumentCatalog().getAcroForm();
        if ( acroForm != null ) {
            try {
                acroForm.flatten();
            } catch (IOException e) {
                throw new PdfSanitizationException(e);
            }
        }

    }

    /**
     * https://pdfbox.apache.org/docs/2.0.8/javadocs/org/apache/pdfbox/pdmodel/interactive/annotation/PDAnnotation.html
     * @hint returns all annotations within the pdf as an array; the type of each object returned
is PDAnnotation, so you'll need to look at the javadocs for that to see what methods are available
     */
    protected List<PDAnnotation> listAnnotations(PDDocument pdfDoc) throws PdfSanitizationException
{
        List<PDAnnotation> annotations = new ArrayList<>();
        PDPageTree pages = pdfDoc.getPages();
        Iterator<PDPage> iterator = pages.iterator();
        while( iterator.hasNext() ) {
            PDPage page = iterator.next();
            try {
                annotations.addAll(page.getAnnotations());
            } catch (IOException e) {
                throw new PdfSanitizationException(e);
            }
        }
        return annotations;
    }

    /**
     * https://stackoverflow.com/questions/32741468/how-to-delete-annotations-in-pdf-file-using-pdfbox
     * https://lists.apache.org/thread.html/d5b5f7a1d07d4eb9c515054ae7e87bdf4aefb3f138b235f82297401d@%3Cusers.pdfbox.apache.org%3E
     * @hint Strips out comments and other annotations
     * Form fields are made visible/usable via annotations (as I understand it); consequently,
removing all annotations renders forms,
     * effectively, invisible and unusable, though the markup remains present (visible via
the Debugger).
     * The default behavior, therefore, is to leave annotations related to forms present,
     * so that the forms remain functional. While you can remove form annotations by setting
preserveForm = false,
     * the better approach is to use flatten().
     * Reminder: Added links are a type of annotation (PDAnnotationLink) so they're removed
by this method
     */
    protected void removeAnnotations( PDDocument pdfDoc, Boolean preserveForm) throws PdfSanitizationException
{
        PDPageTree pages = pdfDoc.getPages();
        Iterator<PDPage> iterator = pages.iterator();

        while( iterator.hasNext() ) {
            PDPage page = iterator.next();

            if ( !preserveForm ) {
                page.setAnnotations(null);
            } else {
                List<PDAnnotation> annotations = new ArrayList<>();

                try {
                    for(PDAnnotation annotation: page.getAnnotations()) {
                        if (annotation.getSubtype().equalsIgnoreCase("Widget")) {
                            annotations.add(annotation);
                        }
                    }
                } catch (IOException e) {
                    throw new PdfSanitizationException(e);
                }
                page.setAnnotations( annotations );
            }
        }
    }


    /**
     * https://stackoverflow.com/questions/17019960/extract-embedded-files-from-pdf-using-pdfbox-in-net-application
     * https://github.com/Valuya/fontbox/blob/master/examples/src/main/java/org/apache/pdfbox/examples/pdmodel/EmbeddedFiles.java
     * @hint Removes embedded files
     */
    protected void removeEmbeddedFiles(PDDocument pdfDoc) {

        PDDocumentNameDictionary namesDictionary = new PDDocumentNameDictionary(pdfDoc.getDocumentCatalog());
        PDEmbeddedFilesNameTreeNode efTree =namesDictionary.getEmbeddedFiles();

        if (efTree != null) {
            efTree.getCOSObject().clear();

        }
    }

    /**
     * @hint Attempts to remove all javascript from the pdf. Javascript can appear in a lot
of places; this tackles the standard locations. If more are found, they'll be incorporated
here.
     */
    protected void removeJavaScript(PDDocument pdfDoc) throws PdfSanitizationException {
        removeEmbeddedJavaScript(pdfDoc);
        removeDocumentJavaScriptActions(pdfDoc);
        removeFormFieldActions(pdfDoc);
        removeLinkActions(pdfDoc);
    }

    /**
     * @hint Removes the javascript embedded in the document itself
     */
    protected void removeEmbeddedJavaScript(PDDocument pdfDoc) {
        PDDocumentNameDictionary namesDictionary = new PDDocumentNameDictionary(pdfDoc.getDocumentCatalog());
        PDJavascriptNameTreeNode embeddedJavaScript = namesDictionary.getJavaScript();
        if (embeddedJavaScript != null) {
            embeddedJavaScript.getCOSObject().clear();
        }
    }

    /**
     * https://pdfbox.apache.org/docs/2.0.8/javadocs/org/apache/pdfbox/pdmodel/interactive/action/PDDocumentCatalogAdditionalActions.html
     * @hint Removes the actions that can be triggered on open, before close, before/after
printing, and before/after saving
     */
    protected void removeDocumentJavaScriptActions(PDDocument pdfDoc) {

        PDDocumentCatalog catalog = pdfDoc.getDocumentCatalog();
        catalog.setOpenAction(null);

        PDDocumentCatalogAdditionalActions actions = catalog.getActions();
        if (actions != null) {
            actions.setDP( null);
            actions.setDS( null);
            actions.setWC( null);
            actions.setWP( null);
            actions.setWS( null);
        }
    }

    /**
     * https://pdfbox.apache.org/docs/2.0.8/javadocs/org/apache/pdfbox/pdmodel/interactive/action/PDFormFieldAdditionalActions.html
     * There may be another class this need to address: PDAnnotationAdditionalActions (but
I'm not sure exactly how these actions are differ from those handled here).
     * For reference and future examination, PDAnnotationAdditionalActions is returned by
PDAnnotationWidget (https://pdfbox.apache.org/docs/2.0.8/javadocs/org/apache/pdfbox/pdmodel/interactive/annotation/PDAnnotationWidget.html),
which is the annotation type related to form fields.
     * @hint removes actions embedded in the form fields ( triggered onFocus, onBlur, etc
)
     */
    protected void removeFormFieldActions(PDDocument pdfDoc) {
        PDAcroForm acroForm = pdfDoc.getDocumentCatalog().getAcroForm();

        if ( acroForm != null ) {
            Iterator<PDField> iterator = acroForm.getFieldIterator();

            while( iterator.hasNext() ) {
                PDField formField = iterator.next();
                PDFormFieldAdditionalActions formFieldActions = formField.getActions();

                if ( formFieldActions != null ) {
                    formFieldActions.getCOSObject().clear();
                }
            }
        }
    }

    /**
     * https://pdfbox.apache.org/docs/2.0.8/javadocs/org/apache/pdfbox/pdmodel/interactive/annotation/PDAnnotationLink.html
     * @hint removes actions embedded in the links ( triggered onFocus, onBlur, etc )
     */
    protected void removeLinkActions(PDDocument pdfDoc) throws PdfSanitizationException {
        PDPageTree pages = pdfDoc.getPages();
        Iterator<PDPage> iterator = pages.iterator();

        while( iterator.hasNext() ) {
            PDPage page = iterator.next();

            try {
            List<PDAnnotation> annotations = page.getAnnotations();

                for(PDAnnotation annotation: annotations) {
                    if (annotation.getSubtype() == "Link") {
                        PDAnnotationLink link = (PDAnnotationLink) annotation;
                        PDAction action = link.getAction();
                        if (action.getSubType() == "JavaScript") {
                            action.getCOSObject().clear();
                        }
                    }
                }
            } catch (IOException e) {
               throw new PdfSanitizationException(e);
            }
        }
    }

    /**
     * @hint Removes metadata from the document
     *
     * Reference: metadata is stored in two separate locations in a document:
     * The Info (Document Information) - likely a key value pairing.
     * The XMP XML
     * Different PDF readers, when displaying document information may give preference to
different sources. For example, Preview may read the "Author A" from Document Information,
while Acrobat may ignore that and read dc:creator element from the XML and display "Author
B".
     * Using the PDFDebugger bundled with PDFBox, via `java -jar pdfbox-app-2.0.11.jar PDFDebugger
-viewstructure example.pdf` will provide an accurate view of both Document Information and
XML metadata, and so is preferable to pdf readers
     *
     */
    protected void removeMetaData(PDDocument pdfDoc) {
        PDDocumentInformation documentInfo = pdfDoc.getDocumentInformation();
        documentInfo.setAuthor(null);
        documentInfo.setCreationDate(null);
        documentInfo.setCreator(null);
        documentInfo.setKeywords(null);
        documentInfo.setModificationDate(null);
        documentInfo.setProducer(null);
        documentInfo.setSubject(null);
        documentInfo.setTitle(null);
        documentInfo.setTrapped(null);

        /*
        org.apache.xmpbox.XMPMetadata

        var XMPMetadata = createObject( 'java', 'org.apache.xmpbox.XMPMetadata' );
        var metadata = XMPMetadata.createXMPMetadata();

        var serializer = createObject( 'java', 'org.apache.xmpbox.xml.XmpSerializer' );
        var baos = createObject( 'java', 'java.io.ByteArrayOutputStream' ).init();
        serializer.serialize( metadata, baos, true );
        var metadataStream = createObject( 'java', 'org.apache.pdfbox.pdmodel.common.PDMetadata'
).init( variables.pdf );
        metadataStream.importXMPMetadata( baos.toByteArray() );
        variables.pdf.getDocumentCatalog().setMetadata( metadataStream );

        variables.hasMetadata = false;
         */
    }

    /**
     * https://lists.apache.org/thread.html/801ea985610d3adf51cb69103729797af3a745a9364bc3f442f80384@%3Cusers.pdfbox.apache.org%3E
     * @hint If there is an embedded search index, this removes it (at least instances of
an embedded searches that I've seen)
     */
    protected void removeEmbeddedIndex(PDDocument pdfDoc) {

        COSBase placeInfo = pdfDoc.getDocumentCatalog().getCOSObject().getItem("PieceInfo");
        if (placeInfo != null) {
            ((COSDictionary) placeInfo).removeItem(COSName.getPDFName("SearchIndex"));
        }

    }

    /**
     * @hint Runs all data removal methods on the pdf. As new methods are added to the component,
they'll be added here as well. Please be aware that sensitive data may remain in the pdf,
even after running this method.
     */
    public void sanitize(PDDocument pdfDoc) throws PdfSanitizationException {
        removeAnnotations(pdfDoc, false);
        removeEmbeddedFiles(pdfDoc);
        removeJavaScript(pdfDoc);
        removeEmbeddedIndex(pdfDoc);
        removeMetaData(pdfDoc);
        flatten(pdfDoc);

    }

}



________________________________________
From: Tilman Hausherr [THausherr@t-online.de]
Sent: 22 July 2019 18:03
To: users@pdfbox.apache.org
Subject: Re: Unable to remove the virus associated with the file

Hi,

I'm uable to download that file... nothing happens. Maybe my antivirus
prevents to download it. I suggest you upload it as text. (rename to .txt)

Anyway, you should just tell the sender that it is a virus. Either there
is some suspicious javascript in it, or something else that triggers mayhem.

Tilman

Am 22.07.2019 um 16:22 schrieb Kodjo Afriyie - iSite Eng:
> Hi,
>
> I have been trying to remove a virus that has been detected on a pdf file..
> The link below is the offending file..
>
> https://1drv.ms/u/s!AmNEMt7g6Kbuhh2hqVo8iKKEn9Tj?e=ERJ1uq
>
> Below is the message that is displayed when the file is downloaded onto my computer.
>
> [X]
>
>
> Thanks,
> Kodjo
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message