Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 36499 invoked from network); 16 Aug 2006 13:01:36 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 16 Aug 2006 13:01:36 -0000 Received: (qmail 2085 invoked by uid 500); 16 Aug 2006 13:01:31 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 2055 invoked by uid 500); 16 Aug 2006 13:01:31 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 2036 invoked by uid 99); 16 Aug 2006 13:01:31 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 16 Aug 2006 06:01:31 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: local policy) Received: from [85.118.4.17] (HELO snorlax.uk.scalix.com) (85.118.4.17) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 16 Aug 2006 06:01:30 -0700 Received: from snorlax.uk.scalix.com (root@localhost) by snorlax.uk.scalix.com (8.12.11.20060308/8.12.11) with ESMTP id k7GD13T3018926 for ; Wed, 16 Aug 2006 14:01:04 +0100 Received: from [10.11.108.20] (snorlax.uk.scalix.com 10.11.108.216) by snorlax.uk.scalix.com (Scalix SMTP Relay 11.0.0.153-alpha) via ESMTP; Wed, 16 Aug 2006 14:01:02 +0100 (BST) Date: Wed, 16 Aug 2006 14:00:51 +0100 From: John Haxby To: java-user@lucene.apache.org Message-ID: <44E31703.9050705@scalix.com> In-Reply-To: References: References: <44E2E7C6.1000501@scalix.com> References: Subject: Re: Best Practice: emails and file-attachments x-scalix-Hops: 1 User-Agent: Thunderbird 1.5.0.5 (X11/20060808) MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII"; format="flowed" Content-Disposition: inline X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N lude wrote: >> You also mentioned indexing each bodypart ("attachment") separately. >> Why? .... >> To my mind, there is no use case where it makes sense to search a >> particular bodypart > > I will give you the use case: > > [snip] > 3.) The result list would show this: > 1. mail-1 'subject' > 'Abstract of the message-text' > 2. mail-2 'subject' > Attachment with name 'filename.doc' contains 'Abstract of > file-content' > > Another Use-Case would be an extended search, which allows to select if > "attached files" > should be searched (yes or no). That's a good use case. File it as a bug and close it WONTFIX :-) The problem that you have is trying to determine whether something is going to be inline or an attachment. I'll give you a real-life example that caught out some old code the other day. We had a message with this structure: multipart/alternative text/plain multipart/related text/html image/gif image/gif application/msword Is there an attached file in there? Think before you read on. The answer should be "no". Are you surprised that at least one client decided that there was? What we have is three representations of the same document: plain text, html (with two pictures) and MS Word. The original, the Word document obviously has the best fidelity and comes last. The one client I'm thinking of (and I've lost track of which one it was) correctly suppressed the display of the text/plain alternative, displayed the HTML with its pictures in-line and then mistakenly displayed the Word document as an attachment. This is a fictional example, but it could exist: multipart/related text/html image/gif application/msword The gif image (and let's assume it can be indexed sensibly) is "obviously" a picture in the HTML bodypart. What's the word document? It's referenced from the HTML as a link just like the picture is. Is it an attachment? What's the difference between the word document referenced as a link within the multipart/related (by content-id) and a link to an external document (by http URL)? From a user perspective both are the same, but is one an attachment and the other not? I'm being unfair, this is not only an unrealistic problem but there isn't a right or a wrong answer. The word document isn't an attachment because it doesn't (or shouldn't) appear in the list of attachments and it's not in-line because you have to click on something to see it. So yes, I agree, your use-cases are good; I'm just not sure how you're going to identify an attachment :-) I do like the idea, though, of when you do a search for "xyzzy" that you get the abstract of the bodypart that contains "xyzzy" rather than the abstract (or subject) of the entire message and I'm going to think about that one some more. The problem that immediately springs to mind though is that a message can have an arbitrary number of bodyparts so if I have BODY-1, BODY-2, ..., BODY-N (where N is unknown) how hard is it for me to construct the search? I think I probably should construct the search that way because the score depends upon the size of the document and it seems to make sense that the document is the bodypart, not the entire message, but it seems more complex than is useful for mail messages. jch --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org