poi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dave Fisher <dave2w...@comcast.net>
Subject Re: Problem extracting date from Outlook 2007 .msg file
Date Wed, 22 Aug 2012 00:12:12 GMT
Hi Joe,

Are you looking to pay this person to help or are you looking for someone with the same "itch"
as you?

(Not that I am volunteering either way - it's not my area.)

Regards,
Dave

On Aug 21, 2012, at 2:33 PM, Joe Wicentowski wrote:

> Hi all,
> 
> I hadn't heard from anyone about the question I posed last week --
> regarding POI/HSMF's problems identifying dates in Outlook .msg files.
> Is there a better forum for me to post this?  Should I file a bug?
> Ideally, I'd like to find someone who can help complete the fix that
> Nick Burch began in POI's SVN trunk.
> 
> Thanks for any pointers about the best way to proceed,
> Joe
> 
> On Thu, Aug 16, 2012 at 6:52 PM, Joe Wicentowski <joewiz@gmail.com> wrote:
>> Hi all,
>> 
>> Hello!  This is my message to the list.  I'm building an application
>> that relies on Tika to extract text from Outlook 2007 .msg files.
>> Tika relies on POI's HSMF libraries, which is why I'm writing to this
>> list about a problem: HSMF is not pulling out the date of many of my
>> Outlook messages.
>> 
>> For example, when I look at one of my message files (.msg) in Outlook,
>> it says that the message was sent on "Fri 6/22/2012 8:11 AM", but when
>> I process the same message with Tika, no date appears in the output
>> [1].
>> 
>> In comparison, I tried using a different tool, ruby-msg
>> (http://code.google.com/p/ruby-msg/), to process the same message, and
>> ruby-msg did pull out the date [2].  This experiment shows that the
>> email *is* in the .msg file, and that Tika is failing to pick it up.
>> 
>> Nick Burch from the Tika mailing list took a close, hands-on look at
>> my .msg file, determined the cause, and outlined a path to the fix:
>> 
>>> I think I've figured out what's wrong. It looks like outlook stores
>>> properties with a fixed size of 0-8 bytes in a different chunk in the file,
>>> which we weren't processing.
>>> 
>>> If you wanted to tackle it, that'd be great! You'll want to take a look at
>>> PropertiesChunk, and fill in the TODOs for readProperties and
>>> writeProperties, then add unit tests. See:
>>> 
>>> http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hsmf/datatypes/PropertiesChunk.java?view=markup
>>> 
>>> When that's all done and working, then
>>> the final step is to update MAPIMessage to read some of the values as needed
>>> out of the properties
>>> 
>>> The info I've been working with comes from this blog post:
>>> http://blogs.msdn.com/b/openspecification/archive/2009/11/06/msg-file-format-part-1.aspx
>>> 
>>> (That links into suitable bits of the public documentation)
>>> 
>>> I suspect it's under a day's work. I've put in place the basics, just needs someone
to flesh it out.
>> 
>> While Nick kindly tracked down the cause, unfortunately I lack the
>> java chops to complete the solution.
>> 
>> Would anyone here be kind enough to assist me with this?
>> 
>> I'm happy to test any attempted fixes, and I'm happy to provide more
>> info, like sample Outlook files (.msg files).  My hope is that this
>> fix will allow POI to "just work" for more users who are evaluating
>> it.
>> 
>> Thank you in advance,
>> Joe
>> 
>> 
>> [1] Tika output showing no date, retrieved via the following command:
>> 
>>   java -jar tika-app-1.1.jar "Inquiry.msg" > inquiry.html
>> 
>> <html xmlns="http://www.w3.org/1999/xhtml">
>>    <head>
>>        <meta name="Message-Bcc" content="" />
>>        <meta name="subject" content="Inquiry" />
>>        <meta name="Content-Length" content="40960" />
>>        <meta name="Message-Recipient-Address" content="snip@gmail.com" />
>>        <meta name="Message-From" content="History Mailbox" />
>>        <meta name="Author" content="History Mailbox" />
>>        <meta name="Message-To" content="'Snip'" />
>>        <meta name="Message-Cc" content="" />
>>        <meta name="Content-Type" content="application/vnd.ms-outlook" />
>>        <meta name="resourceName" content="RE  Inquiry.msg" />
>>    </head>
>>    <body>
>>        <h1>RE: Inquiry</h1>
>>        <dl>
>>            <dt>From</dt>
>>            <dd>History Mailbox</dd>
>>            <dt>To</dt>
>>            <dd>'Snip'</dd>
>>            <dt>Recipients</dt>
>>            <dd>snip@gmail.com</dd>
>>        </dl>
>>        <p>Dear Snip</p>
>> ...
>> 
>> [2] The ruby-msg output -- notice the "Date:" line:
>> 
>> From: "History Mailbox" <removed-address@removed.com>
>> To: "Snip" <snip@gmail.com>
>> Subject: RE: Inquiry
>> Date: Fri, 22 Jun 2012 12:11:00 -0000
>> Message-ID: <000807F9A285794EAAD13EC6EAE33A760117AE0E237B@PASA1MB01.pace.unc>
>> In-Reply-To: <CAJ4nNe1FPo7Q=10dbK8sdzPRaRzYKJV6SKV3nyg5L2Li13b+og@mail.gmail.com>
>> Priority: 0
>> Thread-Topic: Inquiry
>> Content-Type: multipart/alternative;
>> boundary="----_=_NextPart_001_8149ed38.4fec8c61"
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
> For additional commands, e-mail: dev-help@poi.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Mime
View raw message