poi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joe Wicentowski <joe...@gmail.com>
Subject Re: Problem extracting date from Outlook 2007 .msg file
Date Wed, 22 Aug 2012 00:50:29 GMT
Hi Dave,

I would happily accept quotes for the job; please send quotes to me off list.

Thanks,
Joe

Sent from my iPad

On Aug 21, 2012, at 8:12 PM, Dave Fisher <dave2wave@comcast.net> wrote:

> Hi Joe,
> 
> Are you looking to pay this person to help or are you looking for someone with the same
"itch" as you?
> 
> (Not that I am volunteering either way - it's not my area.)
> 
> Regards,
> Dave
> 
> On Aug 21, 2012, at 2:33 PM, Joe Wicentowski wrote:
> 
>> Hi all,
>> 
>> I hadn't heard from anyone about the question I posed last week --
>> regarding POI/HSMF's problems identifying dates in Outlook .msg files.
>> Is there a better forum for me to post this?  Should I file a bug?
>> Ideally, I'd like to find someone who can help complete the fix that
>> Nick Burch began in POI's SVN trunk.
>> 
>> Thanks for any pointers about the best way to proceed,
>> Joe
>> 
>> On Thu, Aug 16, 2012 at 6:52 PM, Joe Wicentowski <joewiz@gmail.com> wrote:
>>> Hi all,
>>> 
>>> Hello!  This is my message to the list.  I'm building an application
>>> that relies on Tika to extract text from Outlook 2007 .msg files.
>>> Tika relies on POI's HSMF libraries, which is why I'm writing to this
>>> list about a problem: HSMF is not pulling out the date of many of my
>>> Outlook messages.
>>> 
>>> For example, when I look at one of my message files (.msg) in Outlook,
>>> it says that the message was sent on "Fri 6/22/2012 8:11 AM", but when
>>> I process the same message with Tika, no date appears in the output
>>> [1].
>>> 
>>> In comparison, I tried using a different tool, ruby-msg
>>> (http://code.google.com/p/ruby-msg/), to process the same message, and
>>> ruby-msg did pull out the date [2].  This experiment shows that the
>>> email *is* in the .msg file, and that Tika is failing to pick it up.
>>> 
>>> Nick Burch from the Tika mailing list took a close, hands-on look at
>>> my .msg file, determined the cause, and outlined a path to the fix:
>>> 
>>>> I think I've figured out what's wrong. It looks like outlook stores
>>>> properties with a fixed size of 0-8 bytes in a different chunk in the file,
>>>> which we weren't processing.
>>>> 
>>>> If you wanted to tackle it, that'd be great! You'll want to take a look at
>>>> PropertiesChunk, and fill in the TODOs for readProperties and
>>>> writeProperties, then add unit tests. See:
>>>> 
>>>> http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hsmf/datatypes/PropertiesChunk.java?view=markup
>>>> 
>>>> When that's all done and working, then
>>>> the final step is to update MAPIMessage to read some of the values as needed
>>>> out of the properties
>>>> 
>>>> The info I've been working with comes from this blog post:
>>>> http://blogs.msdn.com/b/openspecification/archive/2009/11/06/msg-file-format-part-1.aspx
>>>> 
>>>> (That links into suitable bits of the public documentation)
>>>> 
>>>> I suspect it's under a day's work. I've put in place the basics, just needs
someone to flesh it out.
>>> 
>>> While Nick kindly tracked down the cause, unfortunately I lack the
>>> java chops to complete the solution.
>>> 
>>> Would anyone here be kind enough to assist me with this?
>>> 
>>> I'm happy to test any attempted fixes, and I'm happy to provide more
>>> info, like sample Outlook files (.msg files).  My hope is that this
>>> fix will allow POI to "just work" for more users who are evaluating
>>> it.
>>> 
>>> Thank you in advance,
>>> Joe
>>> 
>>> 
>>> [1] Tika output showing no date, retrieved via the following command:
>>> 
>>>  java -jar tika-app-1.1.jar "Inquiry.msg" > inquiry.html
>>> 
>>> <html xmlns="http://www.w3.org/1999/xhtml">
>>>   <head>
>>>       <meta name="Message-Bcc" content="" />
>>>       <meta name="subject" content="Inquiry" />
>>>       <meta name="Content-Length" content="40960" />
>>>       <meta name="Message-Recipient-Address" content="snip@gmail.com" />
>>>       <meta name="Message-From" content="History Mailbox" />
>>>       <meta name="Author" content="History Mailbox" />
>>>       <meta name="Message-To" content="'Snip'" />
>>>       <meta name="Message-Cc" content="" />
>>>       <meta name="Content-Type" content="application/vnd.ms-outlook" />
>>>       <meta name="resourceName" content="RE  Inquiry.msg" />
>>>   </head>
>>>   <body>
>>>       <h1>RE: Inquiry</h1>
>>>       <dl>
>>>           <dt>From</dt>
>>>           <dd>History Mailbox</dd>
>>>           <dt>To</dt>
>>>           <dd>'Snip'</dd>
>>>           <dt>Recipients</dt>
>>>           <dd>snip@gmail.com</dd>
>>>       </dl>
>>>       <p>Dear Snip</p>
>>> ...
>>> 
>>> [2] The ruby-msg output -- notice the "Date:" line:
>>> 
>>> From: "History Mailbox" <removed-address@removed.com>
>>> To: "Snip" <snip@gmail.com>
>>> Subject: RE: Inquiry
>>> Date: Fri, 22 Jun 2012 12:11:00 -0000
>>> Message-ID: <000807F9A285794EAAD13EC6EAE33A760117AE0E237B@PASA1MB01.pace.unc>
>>> In-Reply-To: <CAJ4nNe1FPo7Q=10dbK8sdzPRaRzYKJV6SKV3nyg5L2Li13b+og@mail.gmail.com>
>>> Priority: 0
>>> Thread-Topic: Inquiry
>>> Content-Type: multipart/alternative;
>>> boundary="----_=_NextPart_001_8149ed38.4fec8c61"
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
>> For additional commands, e-mail: dev-help@poi.apache.org
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
> For additional commands, e-mail: dev-help@poi.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Mime
View raw message