lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Haxby <...@scalix.com>
Subject Re: indexing emails
Date Mon, 19 Jun 2006 11:48:09 GMT
Michael J. Prichard wrote:
> I am working on indexing emails and want to have a "to" field.  I am 
> currently putting all the emails on one line seperated w/ 
> spaces...example:
>
> michael@foo.bar john@boo.com jane@bar.com
>
> Then i index that with a StandardAnalyzer as follows:
>
> doc.add(new Field("to", (String) itemContent.get("to"), 
> Field.Store.YES, Field.Index.UN_TOKENIZED));
>
> Question is...is this the best way to do it?  I want to be able to 
> search for michael@foo.bar and pick out just those Documents, etc.
I took a slightly different approach.   Using javamail, given a To: line 
like this:

    To: Fred Smith <fs@example.com>, 
=?ISO-8859-1?Q?Keld_J=F8rn_Simonsen?= <keld@example.com>

I re-constructed the address list to look like this:

    Fred Smith fs@example.com Keld Jørn Simonsen keld@example.com

and fed that to the analyser.   I forget which analyser we eventually 
settled on, but the "fred@example.com" turns into the tokens "fred" 
"example" and "com".   This actually gives rise to a remarkably natural 
way of search for
addresses.   People do things like searching for "lucene.apache.org" to 
look for mail sent to the lucene lists, they search for me variously as 
"jch", "john haxby" and "haxby"; they even, occasionally, search for 
complete mail addresses.   They all work.

The RFC2047 syntax in the example above gives one hint as to the 
minefield that address parsing can be.   If you look at the javamail 
spec, you'll also see reference to group-syntax -- it's often seen as

    undisclosed-recipients:;

but you'll also occasionally see

    example-group: fred@example.com, keld@example.com;

Javamail knows how to parse these and I threw away the group name and 
just indexed the messages.   It might've been better to keep the group 
name, but groups aren't that widely used so it probably doesn't make 
much difference.

Other heads cause headaches as well.   Things like the subject can be 
RFC2047 encoded so you'll need to decode them.  The various message-id 
headers are also slightly problematic.   If you're using "message-id" 
and "references" and "in-reply-to" you'll need to be careful -- the 
individual message-id's will need their angle brackets removed and they 
really ought not to be tokenized.

It's also worth indexing *all* the message headers.   People do do 
searches on some odd things.   I also index the raw content-type as well 
-- those huge presentations can be found and deleted by searching for 
"content-type:application/vnd.ms-powerpoint".   Or at least I could.  It 
seems to be broken at the moment :-(

jch

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message