lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject RE: indexing emails
Date Mon, 19 Jun 2006 08:21:47 GMT

: As far as indexing goes index each address in a separate un-tokenized
: field not space delimited in a single field. It is also useful to put
: the To; CC and BCC in a single field to enable you to search to email

INdexing email isn't something i've had to think about a lot in my life ..
but if i were going to do it i would certianly have both header specific
fields as well as a "recpipients" field containing To/Cc/Bcc and a
"participants" field that also contained the From/Sender/X-Sender.

I would add each address as a seperate Field instance, using a custom
"EmailAnalyzer" with a really high position incriment gap.  The Analyzer
should index both the full input with no tokenization, as well as the
input split on the @ symbol, and the input tokenized on any character in
the set "_-.+" to the left of the @ and on "." to the right of the @ ...
BUT: not the last "."

So for the input "java-user@lucene.apache.org" the following tokenstream
would be created...

	java-user@lucene.apache.org
	java
	user
	java-user
	lucene.apache.org
	lucene
	apache.org


: you have sent to a person. I would also recommend you do some processing
: on the Subject field to remove FW and RE this will allow you to search
: by subject and pick up all emails in the thread.

There are essays and essays and more essays on detecting/infering threads
in email ... as I recall, JWZ has really written the definitive guide for
this.




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message