Hi Joe,
Are you looking to pay this person to help or are you looking for someone with the same "itch"
as you?
(Not that I am volunteering either way - it's not my area.)
Regards,
Dave
On Aug 21, 2012, at 2:33 PM, Joe Wicentowski wrote:
> Hi all,
>
> I hadn't heard from anyone about the question I posed last week --
> regarding POI/HSMF's problems identifying dates in Outlook .msg files.
> Is there a better forum for me to post this? Should I file a bug?
> Ideally, I'd like to find someone who can help complete the fix that
> Nick Burch began in POI's SVN trunk.
>
> Thanks for any pointers about the best way to proceed,
> Joe
>
> On Thu, Aug 16, 2012 at 6:52 PM, Joe Wicentowski <joewiz@gmail.com> wrote:
>> Hi all,
>>
>> Hello! This is my message to the list. I'm building an application
>> that relies on Tika to extract text from Outlook 2007 .msg files.
>> Tika relies on POI's HSMF libraries, which is why I'm writing to this
>> list about a problem: HSMF is not pulling out the date of many of my
>> Outlook messages.
>>
>> For example, when I look at one of my message files (.msg) in Outlook,
>> it says that the message was sent on "Fri 6/22/2012 8:11 AM", but when
>> I process the same message with Tika, no date appears in the output
>> [1].
>>
>> In comparison, I tried using a different tool, ruby-msg
>> (http://code.google.com/p/ruby-msg/), to process the same message, and
>> ruby-msg did pull out the date [2]. This experiment shows that the
>> email *is* in the .msg file, and that Tika is failing to pick it up.
>>
>> Nick Burch from the Tika mailing list took a close, hands-on look at
>> my .msg file, determined the cause, and outlined a path to the fix:
>>
>>> I think I've figured out what's wrong. It looks like outlook stores
>>> properties with a fixed size of 0-8 bytes in a different chunk in the file,
>>> which we weren't processing.
>>>
>>> If you wanted to tackle it, that'd be great! You'll want to take a look at
>>> PropertiesChunk, and fill in the TODOs for readProperties and
>>> writeProperties, then add unit tests. See:
>>>
>>> http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hsmf/datatypes/PropertiesChunk.java?view=markup
>>>
>>> When that's all done and working, then
>>> the final step is to update MAPIMessage to read some of the values as needed
>>> out of the properties
>>>
>>> The info I've been working with comes from this blog post:
>>> http://blogs.msdn.com/b/openspecification/archive/2009/11/06/msg-file-format-part-1.aspx
>>>
>>> (That links into suitable bits of the public documentation)
>>>
>>> I suspect it's under a day's work. I've put in place the basics, just needs someone
to flesh it out.
>>
>> While Nick kindly tracked down the cause, unfortunately I lack the
>> java chops to complete the solution.
>>
>> Would anyone here be kind enough to assist me with this?
>>
>> I'm happy to test any attempted fixes, and I'm happy to provide more
>> info, like sample Outlook files (.msg files). My hope is that this
>> fix will allow POI to "just work" for more users who are evaluating
>> it.
>>
>> Thank you in advance,
>> Joe
>>
>>
>> [1] Tika output showing no date, retrieved via the following command:
>>
>> java -jar tika-app-1.1.jar "Inquiry.msg" > inquiry.html
>>
>> <html xmlns="http://www.w3.org/1999/xhtml">
>> <head>
>> <meta name="Message-Bcc" content="" />
>> <meta name="subject" content="Inquiry" />
>> <meta name="Content-Length" content="40960" />
>> <meta name="Message-Recipient-Address" content="snip@gmail.com" />
>> <meta name="Message-From" content="History Mailbox" />
>> <meta name="Author" content="History Mailbox" />
>> <meta name="Message-To" content="'Snip'" />
>> <meta name="Message-Cc" content="" />
>> <meta name="Content-Type" content="application/vnd.ms-outlook" />
>> <meta name="resourceName" content="RE Inquiry.msg" />
>> </head>
>> <body>
>> <h1>RE: Inquiry</h1>
>> <dl>
>> <dt>From</dt>
>> <dd>History Mailbox</dd>
>> <dt>To</dt>
>> <dd>'Snip'</dd>
>> <dt>Recipients</dt>
>> <dd>snip@gmail.com</dd>
>> </dl>
>> <p>Dear Snip</p>
>> ...
>>
>> [2] The ruby-msg output -- notice the "Date:" line:
>>
>> From: "History Mailbox" <removed-address@removed.com>
>> To: "Snip" <snip@gmail.com>
>> Subject: RE: Inquiry
>> Date: Fri, 22 Jun 2012 12:11:00 -0000
>> Message-ID: <000807F9A285794EAAD13EC6EAE33A760117AE0E237B@PASA1MB01.pace.unc>
>> In-Reply-To: <CAJ4nNe1FPo7Q=10dbK8sdzPRaRzYKJV6SKV3nyg5L2Li13b+og@mail.gmail.com>
>> Priority: 0
>> Thread-Topic: Inquiry
>> Content-Type: multipart/alternative;
>> boundary="----_=_NextPart_001_8149ed38.4fec8c61"
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
> For additional commands, e-mail: dev-help@poi.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
|