lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lude <lucene.develo...@googlemail.com>
Subject Re: Best Practice: emails and file-attachments
Date Wed, 16 Aug 2006 08:50:25 GMT
Hi Dejan,

how do you query for email- and(!) attachment-documents,
if you just want to present one hit per email (even if the searchterm
matches
in the email- and(!) in the corresponding attachment-document)?

Thanks
lude


On 8/15/06, Dejan Nenov <dejannenov@jollyobject.com> wrote:
>
> The approach we I find best is to create both Email documents - where a
> list
> (and links) to all attachments is contained as well as individual
> Attachment
> documents.
>
> It gets a little tricky when you have a forwarded email, containing an
> original Email that contains a tar.gz attachment, which contains the
> "actual" attached files :)
>
> (Shameless promotion follows) If you are a Windows user, for a _very_ good
> example get a copy of X1 Desktop (free - also distributed as Yahoo!
> Desktop
> search) - then right-click on the column headers and look at the available
> fields for email.
>
>
> Dejan
>
> -----Original Message-----
> From: lude [mailto:lucene.developer@googlemail.com]
> Sent: Tuesday, August 15, 2006 10:29 AM
> To: java-user@lucene.apache.org
> Subject: Best Practice: emails and file-attachments
>
> Hello,
>
> does anybody has an idea what is the best design approch for realizing
> the following:
>
> The goal is to index emails and their corresponding file attachments.
> One email could contain for example:
>
> 1 x subject
> 1 x sender-address
> 1 x to-addresses
> 1 x message-text
> 0..n x file-attachments  (each contains a 'file-name' and the
> 'file-content')
>
> How should I build the index?
>
> First approach:
> Each email + attachments gets one document with the following fields:
> subject, sender_address, to_address, message_text, 1_attachment_name,
> 1_attachment_content, 2_attachment_name, 2_attachment_content,
> 3_attachment_name, 3_attachment_content
> Disadvantage:
> Only three attachments could be indexed. It isn't a generic solution for
> indexing 'n' file-attachments.
>
> Second approach:
> Each email gets one document with the main email-data and 0 to n documents
> of file-attachments:
> 1 x  email_id, subject, sender_address, to_address, message_text
> 0..n x  email_id, attachment_name, attachment_content
> Disadvantage:
> At query time it is difficult to aggregate the documents that belongs to
> each other. One hit per email (including attachments) should be shown.
>
> Any thoughts?
>
> Thanks
> lude
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message