lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Sturge <>
Subject Solr DIH: Proposal for MailEntityprocessor option change
Date Thu, 18 Nov 2010 16:46:20 GMT
Hi Solr folks,

I admit I'm new to DIH, so I thought I'd put this out there before
generating a Jira issue:

I've been doing some work with importing emails using the truly
fabulous MailEntityProcessor. Fantastic!

I have noticed, however, that in order to retrieve email content into
the Solr 'content' field, the <entity> processAttachement="true"
property attribute must be set.
While, strictly speaking in the mime world, the content is a body
part, I'm sure I'm not the only one with a use case of wanting to have
the content, but not [necessarily] the attachments.

The code has the content processing *after*
the check for the processAttachement="true".

What I propose is this:

1. Add a new [optional] boolean property called: includeContent. If
'true' the content field would be populated with the (non-attachment)
content of the message. If 'false', the content is not included.
'processAttachement' would behave the same as it does now, but only
for attachments, not text content. I would propose that
includeContent="true" be the default behaviour.
2. Add an additional property attribute called 'processAttachments'
that is a synonym for the mis-spelled and singular
'processAttachement'. processAttachement would remain for bwd compat.
3. It could be nice to have a built-in 'attachmentsPassthrough' and/or
'attachmentsFilter' attribute so that only matching attachment
filenames would be processed (e.g.
    Tika can spend a fair amount of time churning through attachments,
and if for example, there's a lot of graphics files attached, it would
be more efficient to simply skip them if configured to do so.
    Be good to hear others' thoughts on this one

Comments, thoughts, please?


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message