lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christiaan Fluit <>
Subject Re: Exchange/PST/Mail parsing
Date Mon, 02 Jul 2007 09:57:44 GMT
Hello Grant (cc-ing aperture-devel),

I am one of the Aperture admins, I can tell you a bit more about 
Aperture's mail facilities.

Short intro: Aperture is a framework for crawling and full-text and 
metadata extraction of a growing number of sources and file formats. We 
try to select the best-of-breed of the large number of open source 
libraries that tackle a specific source or format (e.g. PDFBox, Poi, 
JavaMail) and write some glue code around it so that they can be invoked 
in a uniform way. It's currently used in a number of desktop and 
enterprise search applications, both research and production systems.

At the moment we support a number of mail systems.

We can crawl IMAP mail boxes through JavaMail. In general it seems to 
work well, problems are usually caused by IMAP servers not conforming to 
the IMAP specs.

Some people have used the ImapCrawler to crawl MS Exchange as well. Some 
succeeded, some didn't. I don't really know whether the fault is in 
Aperture's code or in the Exchange configuration but I would be happy to 
take a look at it when someone runs into problems.

Outlook can also be crawled by connecting to a running Outlook process 
through jacob.dll. Others on aperture-devel can tell you more about its 
current status. Besides this crawler, I would also be very interested in 
having a crawler that directly processes .pst files, as to stay clear 
from communicating with other processes outside your own control.

People have been working on crawling Thunderbird mailboxes but I don't 
know what the current status is.

Ultimately, we try to support any major mail system. In practice, effort 
is usually dependent on knowledge and experience as well as customer demand.

We are happy to help you out with trying to get Aperture working in your 
domain and looking into the problems that you may encounter.

Kind regards,


Grant Ingersoll wrote:
> Anyone have any recommendations on a decent, open (doesn't have to be 
> Apache license, but would prefer non-GPL if possible), extractor for MS 
> Exchange and/or PST files?  The Zoe link on the FAQ [1] seems dead.
> For mbox, I think mstor will suffice for me and I think tropo (from the 
> FAQ should work for IMAP).  Does anyone have experience with 
> [1] 

> Cheers,
> Grant

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message