lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lude <lucene.develo...@googlemail.com>
Subject Best Practice: emails and file-attachments
Date Tue, 15 Aug 2006 17:28:41 GMT
Hello,

does anybody has an idea what is the best design approch for realizing
the following:

The goal is to index emails and their corresponding file attachments.
One email could contain for example:

1 x subject
1 x sender-address
1 x to-addresses
1 x message-text
0..n x file-attachments  (each contains a 'file-name' and the
'file-content')

How should I build the index?

First approach:
Each email + attachments gets one document with the following fields:
subject, sender_address, to_address, message_text, 1_attachment_name,
1_attachment_content, 2_attachment_name, 2_attachment_content,
3_attachment_name, 3_attachment_content
Disadvantage:
Only three attachments could be indexed. It isn't a generic solution for
indexing 'n' file-attachments.

Second approach:
Each email gets one document with the main email-data and 0 to n documents
of file-attachments:
1 x  email_id, subject, sender_address, to_address, message_text
0..n x  email_id, attachment_name, attachment_content
Disadvantage:
At query time it is difficult to aggregate the documents that belongs to
each other. One hit per email (including attachments) should be shown.

Any thoughts?

Thanks
lude

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message