poi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Wicentowski <joe...@gmail.com>
Subject Re: Problem extracting date from Outlook 2007 .msg file
Date Tue, 21 Aug 2012 21:33:55 GMT
Hi all,

I hadn't heard from anyone about the question I posed last week --
regarding POI/HSMF's problems identifying dates in Outlook .msg files.
 Is there a better forum for me to post this?  Should I file a bug?
Ideally, I'd like to find someone who can help complete the fix that
Nick Burch began in POI's SVN trunk.

Thanks for any pointers about the best way to proceed,
Joe

On Thu, Aug 16, 2012 at 6:52 PM, Joe Wicentowski <joewiz@gmail.com> wrote:
> Hi all,
>
> Hello!  This is my message to the list.  I'm building an application
> that relies on Tika to extract text from Outlook 2007 .msg files.
> Tika relies on POI's HSMF libraries, which is why I'm writing to this
> list about a problem: HSMF is not pulling out the date of many of my
> Outlook messages.
>
> For example, when I look at one of my message files (.msg) in Outlook,
> it says that the message was sent on "Fri 6/22/2012 8:11 AM", but when
> I process the same message with Tika, no date appears in the output
> [1].
>
> In comparison, I tried using a different tool, ruby-msg
> (http://code.google.com/p/ruby-msg/), to process the same message, and
> ruby-msg did pull out the date [2].  This experiment shows that the
> email *is* in the .msg file, and that Tika is failing to pick it up.
>
> Nick Burch from the Tika mailing list took a close, hands-on look at
> my .msg file, determined the cause, and outlined a path to the fix:
>
>> I think I've figured out what's wrong. It looks like outlook stores
>> properties with a fixed size of 0-8 bytes in a different chunk in the file,
>> which we weren't processing.
>>
>> If you wanted to tackle it, that'd be great! You'll want to take a look at
>> PropertiesChunk, and fill in the TODOs for readProperties and
>> writeProperties, then add unit tests. See:
>>
>>  http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hsmf/datatypes/PropertiesChunk.java?view=markup
>>
>> When that's all done and working, then
>> the final step is to update MAPIMessage to read some of the values as needed
>> out of the properties
>>
>> The info I've been working with comes from this blog post:
>> http://blogs.msdn.com/b/openspecification/archive/2009/11/06/msg-file-format-part-1.aspx
>>
>> (That links into suitable bits of the public documentation)
>>
>> I suspect it's under a day's work. I've put in place the basics, just needs someone
to flesh it out.
>
> While Nick kindly tracked down the cause, unfortunately I lack the
> java chops to complete the solution.
>
> Would anyone here be kind enough to assist me with this?
>
> I'm happy to test any attempted fixes, and I'm happy to provide more
> info, like sample Outlook files (.msg files).  My hope is that this
> fix will allow POI to "just work" for more users who are evaluating
> it.
>
> Thank you in advance,
> Joe
>
>
> [1] Tika output showing no date, retrieved via the following command:
>
>    java -jar tika-app-1.1.jar "Inquiry.msg" > inquiry.html
>
> <html xmlns="http://www.w3.org/1999/xhtml">
>     <head>
>         <meta name="Message-Bcc" content="" />
>         <meta name="subject" content="Inquiry" />
>         <meta name="Content-Length" content="40960" />
>         <meta name="Message-Recipient-Address" content="snip@gmail.com" />
>         <meta name="Message-From" content="History Mailbox" />
>         <meta name="Author" content="History Mailbox" />
>         <meta name="Message-To" content="'Snip'" />
>         <meta name="Message-Cc" content="" />
>         <meta name="Content-Type" content="application/vnd.ms-outlook" />
>         <meta name="resourceName" content="RE  Inquiry.msg" />
>     </head>
>     <body>
>         <h1>RE: Inquiry</h1>
>         <dl>
>             <dt>From</dt>
>             <dd>History Mailbox</dd>
>             <dt>To</dt>
>             <dd>'Snip'</dd>
>             <dt>Recipients</dt>
>             <dd>snip@gmail.com</dd>
>         </dl>
>         <p>Dear Snip</p>
> ...
>
> [2] The ruby-msg output -- notice the "Date:" line:
>
> From: "History Mailbox" <removed-address@removed.com>
> To: "Snip" <snip@gmail.com>
> Subject: RE: Inquiry
> Date: Fri, 22 Jun 2012 12:11:00 -0000
> Message-ID: <000807F9A285794EAAD13EC6EAE33A760117AE0E237B@PASA1MB01.pace.unc>
> In-Reply-To: <CAJ4nNe1FPo7Q=10dbK8sdzPRaRzYKJV6SKV3nyg5L2Li13b+og@mail.gmail.com>
> Priority: 0
> Thread-Topic: Inquiry
> Content-Type: multipart/alternative;
> boundary="----_=_NextPart_001_8149ed38.4fec8c61"

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Mime
View raw message