lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Haxby <...@scalix.com>
Subject Re: Best Practice: emails and file-attachments
Date Wed, 16 Aug 2006 09:39:18 GMT
lude wrote:
> Hi John,
>
> thanks for the detailed answer.
>
> You wrote:
>> If you're indexing a
>> multipart/alternative bodypart then index all the MIME headers, but only
>> index the content of the *first* bodypart.
>
> Does this mean you index just the first file-attachment?
> What do you advice, if you have to index mulitpart bodys (== more then 
> one
> file-attachment)?
> One lucene-document for each part (==file)?
> How do you handle the queries?
MIME has no concept of "attachment", that's something that the user 
agent programs have a concept of -- you "attach" a file to a message.  
The file might be a picture, a word document, a compressed tar archive 
-- as far MIME is concerned they're all the same (well, apart from the 
content-* headers that describe what's "attached").   The MIME type for 
a message with "attachments" is "multipart".   There are several 
subtypes though.   If you're typing a plain text message (whose MIME 
type is text/plain, a message like this one) and you attach a jpeg image 
to it you'll be sending a message whose type is multipart/mixed;  the 
first part will have type text/plain and the second image/jpeg.   In 
Google Mail  under "more options" you can "show original" to see the 
complete MIME message and you'll see the different parts separated by a 
boundary.

OK.   Now I'm in a position to answer your question.   Often, when you 
send an HTML formatted message the content of the message is sent twice: 
once as text/plain and once as text/html (or multipart/related if it has 
pictures and stuff).   The two parts are alternatives, apart from the 
formatting (and pictures) there's no difference between the two parts, 
you can read either.  The best fidelity of the alternatives (and there 
can be more than two) is last, the poorest fidelity first, but the 
intent of the sender is that you can read any of them.   This is a 
multipart/alternative bodypart.   Because all parts of the 
multipart/alternative have the same text then you can index any of them, 
so index the first as that's going to be the easiest to process (it's 
almost always going to be text/plain).

I've skipped loads.   You need to read the RFCs.   Start with RFC2045 
(http://www.rfc.net/rfc2045.html) and keep going.  If you get stuck with 
the details of how messages are constructed, go back and read RFC2822 
first, or at least skim it (it's quite long).  Note that RFC2045 
references RFC822 in its abstract, where ever you see references to 
RFC821 and RFC822 you can read them as references to RFC2821 and RFC2822 
respectively -- the newer ones are a little more precise when they need 
to be and have rather more explanation of awkward cases that you need to 
know about.

Someone earlier (and I'm sorry, I deleleted the message before realising 
i should reply) said something about attached files really being in an 
attached .tar.gz file.   Well, yes and no.   An attached compressed tar 
archive is a bodypart like any other and will need to be indexed like 
any other.   That will involve breaking it open and indexing the files 
that it contains.   It's not really any different to indexing an open 
office document (which is actually a zip file).

You also mentioned indexing each bodypart ("attachment") separately.   
Why?   When I'm searching, am I going to look for the word "xyzzy" in 
the first bodypart?   What if it was a multipart/alternative and 
Thunderbird (in my case) suppressed the first bodypart and "xyzzy" is 
something that couldn't be rendered in the (first) text/plain 
alternative?   To my mind, there is no use case where it makes sense to 
search a particular bodypart.  There *might* be a case for searching the 
"prime" bodypart and "attachments" but when you read the MIME spec 
you'll realise that detecting what the user sees as an attachment is not 
easy: it gets even harder when you discover that different mail user 
agents have different and legal (and sometimes reasonable) ways of 
deciding whether to treat something as in-line or as an attachment.   To 
be honest, people don't remember whether something was an attachment.   
They think "I remember reading about xyzzy in a mail message" and go off 
looking for that.   They often can't tell and remember even less that 
the "xyzzy" was in something that you decided was an attachment.   And 
if your rules for deciding whether you have something that's intended to 
be viewed as an attachment or in-line are different to the rules that 
the  user's mail reader is using then you'll have Awkward Bugs to 
explain.   You'll read about "Content-Disposition" in the RFCs, but 
don't believe that it's a foolproof way of deciding whether or not 
something is an attachment, lack of a content-disposition header doesn't 
mean "inline" or "attachment" and Microsoft, bless, have weird rules all 
of their own for deciding whether to display something in-line or not.

jch

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message