lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rob Staveley (Tom)" <>
Subject RE: Indexing existing email archives
Date Tue, 15 Aug 2006 12:38:11 GMT
 > Will it be possible to index them as it is?

You'll not be indexing headers or the From field specially if you simply
throw the text at the indexer. You'll not even be separating out separate
e-mail messages. You certainly need to be able to separate out e-mail
messages, to treat messages as separate documents. You'll also probably want
to MIME-decode to index attachments etc. Also you may want to unzip etc. If
you want to do this yourself, you should start by getting the Lucene In
Action book and look at the content handlers for different file types etc
etc, but reading the book will probably take a chunk out of your "couple of
days". That's why taking Tropo off the shelf is probably your best route,
assuming that Tropo's analysis is adequate for your needs, handling all of
your attachment requirements. 

Disclaimer: I'm not familiar with Tropo.

-----Original Message-----
From: Rob Staveley (Tom) [] 
Sent: 15 August 2006 13:26
Subject: RE: Indexing existing email archives

OK, we're well off topic now. If you have follow up on this can I recommend
that you don't do this via 

You'll see from that
Thunderbird stores its mail in mbox format. If you use Dovecot for your imap
server, you'll be able to use the mbox file directly. Just copy the file
over. mbox is very easy to work with. You'll have seen from looking at it
with your text editor that  there are "From lines" followed by raw RFC 822
data. You can append one mbox file to another very simply.

-----Original Message-----
From: Suba Suresh []
Sent: 15 August 2006 12:46
Subject: Re: Indexing existing email archives

The mail archives and saved mail folders can be opened up and read as a text
file using textpad or wordpad. Will it be possible to index them as it is?
The mail client is Thunderbird Mailbox.

If I have to use third party software is there anything you can suggest?

suba suresh.

Rob Staveley (Tom) wrote:
>> I have to have this working in next couple of days.
> I had a similar requirement and it look me several weeks to get 
> something working. I think you'll need to make use of as much off the 
> shelf software as you can, given your timescale.
> If you don't want to get your hands dirty with Lucene and like Tropo, 
> why not convert your local storage to mbox/maildir/mbx if it is not 
> already in the appropriate format and light up imapd while you index?
> e.g. If your local format is PST (i.e. Outlook), take a look at 
> -----Original Message-----
> From: Suba Suresh []
> Sent: 14 August 2006 16:50
> To:
> Subject: Indexing existing email archives
> I was looking at "" 
> and my understanding is it is used to retrieve and index the emails 
> that is on the email server. I have some stored emails in folders in 
> my local disk and huge list of email archives in another system. Is 
> there
a way I could index them?
> I have to have this working in next couple of days. Any help and 
> suggestions are appreciated.
> thanks,
> suba suresh.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message