lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lance Norskog <>
Subject Re: Solr DIH: Proposal for MailEntityprocessor option change
Date Fri, 19 Nov 2010 03:05:37 GMT

It might be nice to have a separate internal database of file ->
mimetype, and then just specify mimetypes?

Also it does not download and save 'in-reply-to'. Without these you
cannot reconstruct mail threads.

Tika also has a mailbox file parser. It would be great if both created
the exact same document. I don't know if they do now.

On Thu, Nov 18, 2010 at 8:46 AM, Peter Sturge <> wrote:
> Hi Solr folks,
> I admit I'm new to DIH, so I thought I'd put this out there before
> generating a Jira issue:
> I've been doing some work with importing emails using the truly
> fabulous MailEntityProcessor. Fantastic!
> I have noticed, however, that in order to retrieve email content into
> the Solr 'content' field, the <entity> processAttachement="true"
> property attribute must be set.
> While, strictly speaking in the mime world, the content is a body
> part, I'm sure I'm not the only one with a use case of wanting to have
> the content, but not [necessarily] the attachments.
> The code has the content processing *after*
> the check for the processAttachement="true".
> What I propose is this:
> 1. Add a new [optional] boolean property called: includeContent. If
> 'true' the content field would be populated with the (non-attachment)
> content of the message. If 'false', the content is not included.
> 'processAttachement' would behave the same as it does now, but only
> for attachments, not text content. I would propose that
> includeContent="true" be the default behaviour.
> 2. Add an additional property attribute called 'processAttachments'
> that is a synonym for the mis-spelled and singular
> 'processAttachement'. processAttachement would remain for bwd compat.
> 3. It could be nice to have a built-in 'attachmentsPassthrough' and/or
> 'attachmentsFilter' attribute so that only matching attachment
> filenames would be processed (e.g.
> attachmentsPassthrough="*.gz,*.xls,*.pdf,*.txt"
> attachmentsFilter="*.gif,*.jpg,*.png").
>    Tika can spend a fair amount of time churning through attachments,
> and if for example, there's a lot of graphics files attached, it would
> be more efficient to simply skip them if configured to do so.
>    Be good to hear others' thoughts on this one
> Comments, thoughts, please?
> Thanks,
> Peter
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

Lance Norskog

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message