lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dejan Nenov" <dejanne...@jollyobject.com>
Subject RE: Best Practice: emails and file-attachments
Date Tue, 15 Aug 2006 18:24:14 GMT
The approach we I find best is to create both Email documents - where a list
(and links) to all attachments is contained as well as individual Attachment
documents.

It gets a little tricky when you have a forwarded email, containing an
original Email that contains a tar.gz attachment, which contains the
"actual" attached files :)

(Shameless promotion follows) If you are a Windows user, for a _very_ good
example get a copy of X1 Desktop (free - also distributed as Yahoo! Desktop
search) - then right-click on the column headers and look at the available
fields for email.


Dejan

-----Original Message-----
From: lude [mailto:lucene.developer@googlemail.com] 
Sent: Tuesday, August 15, 2006 10:29 AM
To: java-user@lucene.apache.org
Subject: Best Practice: emails and file-attachments

Hello,

does anybody has an idea what is the best design approch for realizing
the following:

The goal is to index emails and their corresponding file attachments.
One email could contain for example:

1 x subject
1 x sender-address
1 x to-addresses
1 x message-text
0..n x file-attachments  (each contains a 'file-name' and the
'file-content')

How should I build the index?

First approach:
Each email + attachments gets one document with the following fields:
subject, sender_address, to_address, message_text, 1_attachment_name,
1_attachment_content, 2_attachment_name, 2_attachment_content,
3_attachment_name, 3_attachment_content
Disadvantage:
Only three attachments could be indexed. It isn't a generic solution for
indexing 'n' file-attachments.

Second approach:
Each email gets one document with the main email-data and 0 to n documents
of file-attachments:
1 x  email_id, subject, sender_address, to_address, message_text
0..n x  email_id, attachment_name, attachment_content
Disadvantage:
At query time it is difficult to aggregate the documents that belongs to
each other. One hit per email (including attachments) should be shown.

Any thoughts?

Thanks
lude


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message