james-mime4j-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Frank Fodera (JIRA)" <mime4j-...@james.apache.org>
Subject [jira] [Commented] (MIME4J-281) A Base64 stream which contains padding on each line only decodes the first line
Date Tue, 03 Jul 2018 10:09:00 GMT

    [ https://issues.apache.org/jira/browse/MIME4J-281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16531156#comment-16531156
] 

Frank Fodera commented on MIME4J-281:
-------------------------------------

Hi Tellier,

Thank you for the suggestion, it does sound like a good idea to allow some relaxed base64
parsing as a config option. I do believe that would satisfy this use case.

Also, thank you for double checking about the email containing sensitive information. It
did contain some domains but not any other sensitive information. The PDF was public, so there
was no concern there. I did scrub the mbox file to remove any domain specific information
from it.

-Frank

> A Base64 stream which contains padding on each line only decodes the first line
> -------------------------------------------------------------------------------
>
>                 Key: MIME4J-281
>                 URL: https://issues.apache.org/jira/browse/MIME4J-281
>             Project: James Mime4j
>          Issue Type: Bug
>    Affects Versions: 0.8.1
>            Reporter: Frank Fodera
>            Priority: Major
>         Attachments: 09-17-10.452-6FB06D6C-637-105@superman.info-mbox, As_Cool_as_I_Am_(film).pdf,
base64File.txt
>
>
> *Summary*
>  We are leveraging Tika 1.18 to parse and extract emails which includes James Mime4j
version 0.8.1. One of our customers attempted to parse an mbox file which contained an email
that had a base64 encoded PDF attachment. While opening the mbox file, we noticed that the
attached PDF was encoded in a way that each line was 80 characters and padded with == however
we can't change how they encoded it and we don't know what they used to do so. Later, when
attempting to send the extracted PDF to be parsed, it fails because the PDF was only partially
extracted and is not a valid format.
> It appears that in MimeEntity (decodeStream method) it determines the Inputstream is
Base64 encoded and wraps the LineReaderInputStreamAdaptor to a Base64Inputstream. When later
reading from the stream, the read0 method simply checks for a BASE64_PAD and marks it as EOF
despite having additional content to be parsed.
>  
> *Code to Help Reproduce:*
> {noformat}
> public static void main (String [] args) throws Exception {
>     File initialFile = new    File("/path/to/file/base64File.txt");
>     InputStream inputStream = new FileInputStream(initialFile);
>     org.apache.james.mime4j.io.LineReaderInputStreamAdaptor lineReaderInputStream = new
LineReaderInputStreamAdaptor(inputStream);
>     InputStream base64InputStream = new org.apache.james.mime4j.codec.Base64InputStream(lineReaderInputStream);
>     ByteArrayOutputStream bos = new ByteArrayOutputStream();
>     org.apache.tika.io.IOUtils.copy(base64InputStream, bos);
> }{noformat}
> Leveraging the code above you can see that the encoded PDF (contained in base64File.txt)
only extracts out the first line instead of the entire PDF.
>  
> *Extracting the MBOX via Tika 1.18*
> {noformat}
> [user]$ java -jar tika-app-1.18.jar -m -J  ~/Downloads/09-17-10.452-6FB06D6C-637-105@bradford.info-mbox
| python -m json.tool
> Jun 25, 2018 11:55:55 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
> WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
> See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
> for optional dependencies.
>  
> Jun 25, 2018 11:55:55 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
> WARNING: org.xerial's sqlite-jdbc is not loaded.
> Please provide the jar on your classpath to parse sqlite files.
> See tika-parsers/pom.xml for the correct version.
> [
>     {
>         "Content-Encoding": "windows-1252",
>         "Content-Length": "366503",
>         "Content-Type": "application/mbox",
>         "X-Parsed-By": [
>             "org.apache.tika.parser.DefaultParser",
>             "org.apache.tika.parser.mbox.MboxParser"
>         ],
>         "X-TIKA:parse_time_millis": "199",
>         "resourceName": "09-17-10.452-6FB06D6C-637-105@bradford.info-mbox"
>     },
>     {
>         "Content-Disposition": "attachment; filename=\"/home/test/test/attachments/As_Cool_as_I_Am_(film).pdf\"",
>         "Content-Type": "application/pdf",
>         "Multipart-Boundary": "===============6812308677685932777==",
>         "Multipart-Subtype": "mixed",
>         "X-Parsed-By": [
>             "org.apache.tika.parser.DefaultParser",
>             "org.apache.tika.parser.pdf.PDFParser"
>         ],
>         "X-TIKA:EXCEPTION:embedded_exception": "org.apache.tika.exception.TikaException:
TIKA-198: Illegal IOException from org.apache.tika.parser.pdf.PDFParser@45f45fa1\n\tat org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)\n\tat
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)\n\tat
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)\n\tat org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:318)\n\tat
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)\n\tat org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)\n\tat
org.apache.tika.parser.mail.MailContentHandler.handleEmbedded(MailContentHandler.java:283)\n\tat
org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:228)\n\tat org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133)\n\tat
org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:100)\n\tat org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)\n\tat
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)\n\tat org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:318)\n\tat
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)\n\tat org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)\n\tat
org.apache.tika.parser.mbox.MboxParser.parse(MboxParser.java:135)\n\tat org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)\n\tat
org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:159)\n\tat
org.apache.tika.cli.TikaCLI.handleRecursiveJson(TikaCLI.java:507)\n\tat org.apache.tika.cli.TikaCLI.process(TikaCLI.java:481)\n\tat
org.apache.tika.cli.TikaCLI.main(TikaCLI.java:145)\nCaused by: java.io.IOException: Missing
root object specification in trailer.\n\tat org.apache.pdfbox.pdfparser.COSParser.parseTrailerValuesDynamically(COSParser.java:2727)\n\tat
org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:193)\n\tat org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:240)\n\tat
org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1144)\n\tat org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1117)\n\tat
org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:153)\n\tat org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\t...
25 more\n",
>         "X-TIKA:embedded_resource_path": "/embedded-1/As_Cool_as_I_Am_(film).pdf",
>         "X-TIKA:parse_time_millis": "52",
>         "embeddedResourceType": "ATTACHMENT",
>         "resourceName": "/home/test/test/attachments/As_Cool_as_I_Am_(film).pdf"
>     },
>     {
>         "Author": [
>             "pcox@bradford.info",
>             "pcox@bradford.info"
>         ],
>         "Content-Type": "message/rfc822",
>         "Content-Type-Override": "message/rfc822",
>         "MboxParser-content-disposition": "attachment;",
>         "MboxParser-content-transfer-encoding": [
>             "7bit",
>             "base64"
>         ],
>         "MboxParser-from": "pcox@bradford.info Wed May 16 09:17:10 2018",
>         "MboxParser-mime-version": [
>             "1.0",
>             "1.0"
>         ],
>         "MboxParser-return-path": "<pcox@bradford.info> filename=\"/home/test/test/attachments/As_Cool_as_I_Am_(film).pdf\"",
>         "Message-From": "pcox@bradford.info",
>         "Message-Recipient-Address": "jonesbrenda@miller.com",
>         "Message-To": [
>             "jonesbrenda@miller.com",
>             "jonesbrenda@miller.com"
>         ],
>         "Message:From-Email": "pcox@bradford.info",
>         "Message:Raw-Header:MIME-Version": "1.0",
>         "Message:Raw-Header:Return-Path": "<pcox@bradford.info>",
>         "Multipart-Boundary": "===============6812308677685932777==",
>         "Multipart-Subtype": "mixed",
>         "X-Parsed-By": [
>             "org.apache.tika.parser.DefaultParser",
>             "org.apache.tika.parser.mail.RFC822Parser"
>         ],
>         "X-TIKA:embedded_resource_path": "/embedded-1",
>         "X-TIKA:parse_time_millis": "124",
>         "creator": [
>             "pcox@bradford.info",
>             "pcox@bradford.info"
>         ],
>         "dc:creator": [
>             "pcox@bradford.info",
>             "pcox@bradford.info"
>         ],
>         "dc:format": "application/pdf",
>         "dc:title": "Side question local book claim.",
>         "format": "application/pdf",
>         "meta:author": [
>             "pcox@bradford.info",
>             "pcox@bradford.info"
>         ],
>         "subject": "Side question local book claim."
>     }
> ]{noformat}
>  
> *Attached Files*
>  # The customer's original mbox file (09-17-10.452-6FB06D6C-637-105@bradford.info-mbox)
>  # The base64 encoded PDF in it's own file (base64File.txt)
>  # The extracted PDF standalone (As_Cool_as_I_Am_(film).pdf)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message