Hi Dave,
I would happily accept quotes for the job; please send quotes to me off list.
Thanks,
Joe
Sent from my iPad
On Aug 21, 2012, at 8:12 PM, Dave Fisher <dave2wave@comcast.net> wrote:
> Hi Joe,
>
> Are you looking to pay this person to help or are you looking for someone with the same
"itch" as you?
>
> (Not that I am volunteering either way - it's not my area.)
>
> Regards,
> Dave
>
> On Aug 21, 2012, at 2:33 PM, Joe Wicentowski wrote:
>
>> Hi all,
>>
>> I hadn't heard from anyone about the question I posed last week --
>> regarding POI/HSMF's problems identifying dates in Outlook .msg files.
>> Is there a better forum for me to post this? Should I file a bug?
>> Ideally, I'd like to find someone who can help complete the fix that
>> Nick Burch began in POI's SVN trunk.
>>
>> Thanks for any pointers about the best way to proceed,
>> Joe
>>
>> On Thu, Aug 16, 2012 at 6:52 PM, Joe Wicentowski <joewiz@gmail.com> wrote:
>>> Hi all,
>>>
>>> Hello! This is my message to the list. I'm building an application
>>> that relies on Tika to extract text from Outlook 2007 .msg files.
>>> Tika relies on POI's HSMF libraries, which is why I'm writing to this
>>> list about a problem: HSMF is not pulling out the date of many of my
>>> Outlook messages.
>>>
>>> For example, when I look at one of my message files (.msg) in Outlook,
>>> it says that the message was sent on "Fri 6/22/2012 8:11 AM", but when
>>> I process the same message with Tika, no date appears in the output
>>> [1].
>>>
>>> In comparison, I tried using a different tool, ruby-msg
>>> (http://code.google.com/p/ruby-msg/), to process the same message, and
>>> ruby-msg did pull out the date [2]. This experiment shows that the
>>> email *is* in the .msg file, and that Tika is failing to pick it up.
>>>
>>> Nick Burch from the Tika mailing list took a close, hands-on look at
>>> my .msg file, determined the cause, and outlined a path to the fix:
>>>
>>>> I think I've figured out what's wrong. It looks like outlook stores
>>>> properties with a fixed size of 0-8 bytes in a different chunk in the file,
>>>> which we weren't processing.
>>>>
>>>> If you wanted to tackle it, that'd be great! You'll want to take a look at
>>>> PropertiesChunk, and fill in the TODOs for readProperties and
>>>> writeProperties, then add unit tests. See:
>>>>
>>>> http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hsmf/datatypes/PropertiesChunk.java?view=markup
>>>>
>>>> When that's all done and working, then
>>>> the final step is to update MAPIMessage to read some of the values as needed
>>>> out of the properties
>>>>
>>>> The info I've been working with comes from this blog post:
>>>> http://blogs.msdn.com/b/openspecification/archive/2009/11/06/msg-file-format-part-1.aspx
>>>>
>>>> (That links into suitable bits of the public documentation)
>>>>
>>>> I suspect it's under a day's work. I've put in place the basics, just needs
someone to flesh it out.
>>>
>>> While Nick kindly tracked down the cause, unfortunately I lack the
>>> java chops to complete the solution.
>>>
>>> Would anyone here be kind enough to assist me with this?
>>>
>>> I'm happy to test any attempted fixes, and I'm happy to provide more
>>> info, like sample Outlook files (.msg files). My hope is that this
>>> fix will allow POI to "just work" for more users who are evaluating
>>> it.
>>>
>>> Thank you in advance,
>>> Joe
>>>
>>>
>>> [1] Tika output showing no date, retrieved via the following command:
>>>
>>> java -jar tika-app-1.1.jar "Inquiry.msg" > inquiry.html
>>>
>>> <html xmlns="http://www.w3.org/1999/xhtml">
>>> <head>
>>> <meta name="Message-Bcc" content="" />
>>> <meta name="subject" content="Inquiry" />
>>> <meta name="Content-Length" content="40960" />
>>> <meta name="Message-Recipient-Address" content="snip@gmail.com" />
>>> <meta name="Message-From" content="History Mailbox" />
>>> <meta name="Author" content="History Mailbox" />
>>> <meta name="Message-To" content="'Snip'" />
>>> <meta name="Message-Cc" content="" />
>>> <meta name="Content-Type" content="application/vnd.ms-outlook" />
>>> <meta name="resourceName" content="RE Inquiry.msg" />
>>> </head>
>>> <body>
>>> <h1>RE: Inquiry</h1>
>>> <dl>
>>> <dt>From</dt>
>>> <dd>History Mailbox</dd>
>>> <dt>To</dt>
>>> <dd>'Snip'</dd>
>>> <dt>Recipients</dt>
>>> <dd>snip@gmail.com</dd>
>>> </dl>
>>> <p>Dear Snip</p>
>>> ...
>>>
>>> [2] The ruby-msg output -- notice the "Date:" line:
>>>
>>> From: "History Mailbox" <removed-address@removed.com>
>>> To: "Snip" <snip@gmail.com>
>>> Subject: RE: Inquiry
>>> Date: Fri, 22 Jun 2012 12:11:00 -0000
>>> Message-ID: <000807F9A285794EAAD13EC6EAE33A760117AE0E237B@PASA1MB01.pace.unc>
>>> In-Reply-To: <CAJ4nNe1FPo7Q=10dbK8sdzPRaRzYKJV6SKV3nyg5L2Li13b+og@mail.gmail.com>
>>> Priority: 0
>>> Thread-Topic: Inquiry
>>> Content-Type: multipart/alternative;
>>> boundary="----_=_NextPart_001_8149ed38.4fec8c61"
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
>> For additional commands, e-mail: dev-help@poi.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
> For additional commands, e-mail: dev-help@poi.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org
|