lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Haxby <...@scalix.com>
Subject Re: Best Practice: emails and file-attachments
Date Wed, 16 Aug 2006 13:44:20 GMT

Oh rats. Thunderbird ate the indenting. The two examples should be:

multipart/alternative
	text/plain
	multipart/related
		text/html
		image/gif
		image/gif
	application/msword

and

multipart/related
	text/html
	image/gif
	application/msword

the indenting indicates nesting. A message isn't just a bodypart 
followed by attachments, it has structure like a file system. Something 
which escapes most mail readers. Sigh.


John Haxby wrote:
> lude wrote:
>>> You also mentioned indexing each bodypart ("attachment") separately.
>>> Why? ....
>>> To my mind, there is no use case where it makes sense to search a 
>>> particular bodypart
>>
>> I will give you the use case:
>>
>> [snip]
>> 3.) The result list would show this:
>> 1. mail-1 'subject'
>> 'Abstract of the message-text'
>> 2. mail-2 'subject'
>> Attachment with name 'filename.doc' contains 'Abstract of
>> file-content'
>>
>> Another Use-Case would be an extended search, which allows to select if
>> "attached files"
>> should be searched (yes or no).
>
> That's a good use case. File it as a bug and close it WONTFIX :-) The 
> problem that you have is trying to determine whether something is 
> going to be inline or an attachment. I'll give you a real-life example 
> that caught out some old code the other day. We had a message with 
> this structure:
>
> multipart/alternative
> text/plain
> multipart/related
> text/html
> image/gif
> image/gif
> application/msword
>
> Is there an attached file in there? Think before you read on.
>
>
>
>
>
>
> The answer should be "no". Are you surprised that at least one client 
> decided that there was? What we have is three representations of the 
> same document: plain text, html (with two pictures) and MS Word. The 
> original, the Word document obviously has the best fidelity and comes 
> last. The one client I'm thinking of (and I've lost track of which one 
> it was) correctly suppressed the display of the text/plain 
> alternative, displayed the HTML with its pictures in-line and then 
> mistakenly displayed the Word document as an attachment.
>
> This is a fictional example, but it could exist:
>
> multipart/related
> text/html
> image/gif
> application/msword
>
> The gif image (and let's assume it can be indexed sensibly) is 
> "obviously" a picture in the HTML bodypart. What's the word document? 
> It's referenced from the HTML as a link just like the picture is. Is 
> it an attachment? What's the difference between the word document 
> referenced as a link within the multipart/related (by content-id) and 
> a link to an external document (by http URL)? From a user perspective 
> both are the same, but is one an attachment and the other not? I'm 
> being unfair, this is not only an unrealistic problem but there isn't 
> a right or a wrong answer. The word document isn't an attachment 
> because it doesn't (or shouldn't) appear in the list of attachments 
> and it's not in-line because you have to click on something to see it.
>
> So yes, I agree, your use-cases are good; I'm just not sure how you're 
> going to identify an attachment :-)
>
> I do like the idea, though, of when you do a search for "xyzzy" that 
> you get the abstract of the bodypart that contains "xyzzy" rather than 
> the abstract (or subject) of the entire message and I'm going to think 
> about that one some more. The problem that immediately springs to mind 
> though is that a message can have an arbitrary number of bodyparts so 
> if I have BODY-1, BODY-2, ..., BODY-N (where N is unknown) how hard is 
> it for me to construct the search? I think I probably should construct 
> the search that way because the score depends upon the size of the 
> document and it seems to make sense that the document is the bodypart, 
> not the entire message, but it seems more complex than is useful for 
> mail messages.
>
> jch
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message