lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <>
Subject RE: indexing emails
Date Mon, 19 Jun 2006 08:21:47 GMT

: As far as indexing goes index each address in a separate un-tokenized
: field not space delimited in a single field. It is also useful to put
: the To; CC and BCC in a single field to enable you to search to email

INdexing email isn't something i've had to think about a lot in my life ..
but if i were going to do it i would certianly have both header specific
fields as well as a "recpipients" field containing To/Cc/Bcc and a
"participants" field that also contained the From/Sender/X-Sender.

I would add each address as a seperate Field instance, using a custom
"EmailAnalyzer" with a really high position incriment gap.  The Analyzer
should index both the full input with no tokenization, as well as the
input split on the @ symbol, and the input tokenized on any character in
the set "_-.+" to the left of the @ and on "." to the right of the @ ...
BUT: not the last "."

So for the input "" the following tokenstream
would be created...

: you have sent to a person. I would also recommend you do some processing
: on the Subject field to remove FW and RE this will allow you to search
: by subject and pick up all emails in the thread.

There are essays and essays and more essays on detecting/infering threads
in email ... as I recall, JWZ has really written the definitive guide for


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message