poi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Wicentowski <joe...@gmail.com>
Subject Problem extracting date from Outlook 2007 .msg file
Date Thu, 16 Aug 2012 22:52:02 GMT
Hi all,

Hello!  This is my message to the list.  I'm building an application
that relies on Tika to extract text from Outlook 2007 .msg files.
Tika relies on POI's HSMF libraries, which is why I'm writing to this
list about a problem: HSMF is not pulling out the date of many of my
Outlook messages.

For example, when I look at one of my message files (.msg) in Outlook,
it says that the message was sent on "Fri 6/22/2012 8:11 AM", but when
I process the same message with Tika, no date appears in the output

In comparison, I tried using a different tool, ruby-msg
(http://code.google.com/p/ruby-msg/), to process the same message, and
ruby-msg did pull out the date [2].  This experiment shows that the
email *is* in the .msg file, and that Tika is failing to pick it up.

Nick Burch from the Tika mailing list took a close, hands-on look at
my .msg file, determined the cause, and outlined a path to the fix:

> I think I've figured out what's wrong. It looks like outlook stores
> properties with a fixed size of 0-8 bytes in a different chunk in the file,
> which we weren't processing.
> If you wanted to tackle it, that'd be great! You'll want to take a look at
> PropertiesChunk, and fill in the TODOs for readProperties and
> writeProperties, then add unit tests. See:
>  http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hsmf/datatypes/PropertiesChunk.java?view=markup
> When that's all done and working, then
> the final step is to update MAPIMessage to read some of the values as needed
> out of the properties
> The info I've been working with comes from this blog post:
> http://blogs.msdn.com/b/openspecification/archive/2009/11/06/msg-file-format-part-1.aspx
> (That links into suitable bits of the public documentation)
> I suspect it's under a day's work. I've put in place the basics, just needs someone to
flesh it out.

While Nick kindly tracked down the cause, unfortunately I lack the
java chops to complete the solution.

Would anyone here be kind enough to assist me with this?

I'm happy to test any attempted fixes, and I'm happy to provide more
info, like sample Outlook files (.msg files).  My hope is that this
fix will allow POI to "just work" for more users who are evaluating

Thank you in advance,

[1] Tika output showing no date, retrieved via the following command:

   java -jar tika-app-1.1.jar "Inquiry.msg" > inquiry.html

<html xmlns="http://www.w3.org/1999/xhtml">
        <meta name="Message-Bcc" content="" />
        <meta name="subject" content="Inquiry" />
        <meta name="Content-Length" content="40960" />
        <meta name="Message-Recipient-Address" content="snip@gmail.com" />
        <meta name="Message-From" content="History Mailbox" />
        <meta name="Author" content="History Mailbox" />
        <meta name="Message-To" content="'Snip'" />
        <meta name="Message-Cc" content="" />
        <meta name="Content-Type" content="application/vnd.ms-outlook" />
        <meta name="resourceName" content="RE  Inquiry.msg" />
        <h1>RE: Inquiry</h1>
            <dd>History Mailbox</dd>
        <p>Dear Snip</p>

[2] The ruby-msg output -- notice the "Date:" line:

From: "History Mailbox" <removed-address@removed.com>
To: "Snip" <snip@gmail.com>
Subject: RE: Inquiry
Date: Fri, 22 Jun 2012 12:11:00 -0000
Message-ID: <000807F9A285794EAAD13EC6EAE33A760117AE0E237B@PASA1MB01.pace.unc>
In-Reply-To: <CAJ4nNe1FPo7Q=10dbK8sdzPRaRzYKJV6SKV3nyg5L2Li13b+og@mail.gmail.com>
Priority: 0
Thread-Topic: Inquiry
Content-Type: multipart/alternative;

To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org

View raw message