lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Bell <>
Subject Seeking Advice
Date Wed, 15 Aug 2007 18:01:36 GMT
We are writing a mail archiving program. Each piece of the message (eg each attachment) is
stored separately.

I'll try to keep this short and sweet :)

Currently we index the main header fields, like

recipients (space delimited)


This stuff is really only needed once per e-mail

We also index the attachment info:

attachment size (changed to a range like "large", "medium", etc)
attachment name
full text index

This stuff is needed to be distinct for each attachment in the e-mail

Our current algorithm is wasteful, but I see no better way to do it.

In a loop, for each attachment (and once if we have none), we add all the main header stuff
and the attachment stuff, as a separate Document per attachment. This is wasteful, because
the main header stuff is needlessly repeated.

Now, it would seem better and more efficient to have one Document for the whole e-mail, storing
the main header stuff only once, and storing the Attachment stuff as multiple instances of
the same field. Lucene supports this.

The problem is then a search on attachment stuff will return cross cartesian results.


 if I have 2 attachments one named A.doc and one B.doc. And A.doc contains the full text "turnip"
and B.doc contains the text "dog".

Now if the user enters a search requesting email that contains Attachment name A.Doc, and
contents dog, the results will be

For the Per-Document storage:

no results found (correct I'd argue)

For the Single Document storage:

1 result found (because the full text and names of both are stored in the same Document albeit
different Field instances)

While tempted by the siren call of the Single Document method, it seems like this would return
unexpected results from the users point of view (although one could argue otherwise, since
holistically searching the e-mail as a whole it's returning the "right" results.

What do you folks think? Any ideas for a better way to approach this?



for the edge of your seat? 
Check out tonight's top picks on Yahoo! TV.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message