lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shalin Shekhar Mangar (JIRA)" <j...@apache.org>
Subject [jira] Updated: (SOLR-934) Enable importing of mails into a solr index through DIH.
Date Sat, 11 Apr 2009 21:47:14 GMT

     [ https://issues.apache.org/jira/browse/SOLR-934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Shalin Shekhar Mangar updated SOLR-934:
---------------------------------------

    Attachment: SOLR-934.patch

Changes
# Added messageId as another field
# Added another core to example-DIH for indexing mails. When the example target is run, it
copies over the tika libs, mail.jar, activation.jar and extras.jar into example/example-DIH/solr/mail/lib
directory.
# Added a maven pom template for extras jar
# Updated maven related targets in the main build.xml for the new pom
# Added licenses for mail.jar and activation.jar in LICENSE.txt

I'm not sure what needs to be added to NOTICE.txt, can anybody help?

To run this:
# Apply this patch
# Create a directory called lib inside contrib/dataimporthandler
# Download and add mail.jar and activation.jar in the above directory
# Update example/example-DIH/solr/mail/conf/data-config.xml with your mail server and login
details
# Run ant clean example
# cd example
# java -Dsolr.solr.home=./example-DIH/solr -jar start.jar
# Hit http://localhost:8983/solr/mail/dataimport?command=full-import

I'll let people try this out before committing this in a day or two. 

This will probably need some more enhancements which can be done through additional issues.
Some that I can think of are:
# Pluggable CustomFilter implementations
# Making fields/methods inside MailEntityProcessor protected so functionality can be enhanced/overridden
# Attachments are stored as two attachment and attachmentNames fields -- a way to associate
one with another. I recall some discussion on the LocalSolr issue about something similar
for multiple lat/long pairs.
# Enhance example configuration to be able to run a mailing list search service out-of-the-box

> Enable importing of mails into a solr index through DIH.
> --------------------------------------------------------
>
>                 Key: SOLR-934
>                 URL: https://issues.apache.org/jira/browse/SOLR-934
>             Project: Solr
>          Issue Type: New Feature
>          Components: contrib - DataImportHandler
>    Affects Versions: 1.4
>            Reporter: Preetam Rao
>            Assignee: Shalin Shekhar Mangar
>             Fix For: 1.4
>
>         Attachments: SOLR-934.patch, SOLR-934.patch, SOLR-934.patch, SOLR-934.patch,
SOLR-934.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Enable importing of mails into solr through DIH. Take one or more mailbox credentials,
download and index their content along with the content from attachments. The folders to fetch
can be made configurable based on various criteria. Apache Tika is used for extracting content
from different kinds of attachments. JavaMail is used for mail box related operations like
fetching mails, filtering them etc.
> The basic configuration for one mail box is as below:
> {code:xml}
> <document>
>    <entity processor="MailEntityProcessor" user="somebody@gmail.com" 
>                 password="something" host="imap.gmail.com" protocol="imaps"/>
> </document>
> {code}
> The below is the list of all configuration available:
> {color:green}Required{color}
> ---------
> *user* 
> *pwd* 
> *protocol*  (only "imaps" supported now)
> *host* 
> {color:green}Optional{color}
> ---------
> *folders* - comma seperated list of folders. 
> If not specified, default folder is used. Nested folders can be specified like a/b/c
> *recurse* - index subfolders. Defaults to true.
> *exclude* - comma seperated list of patterns. 
> *include* - comma seperated list of patterns.
> *batchSize* - mails to fetch at once in a given folder. 
> Only headers can be prefetched in Javamail IMAP.
> *readTimeout* - defaults to 60000ms
> *conectTimeout* - defaults to 30000ms
> *fetchSize* - IMAP config. 32KB default
> *fetchMailsSince* -
> date/time in "yyyy-MM-dd HH:mm:ss" format, mails received after which will be fetched.
Useful for delta import.
> *customFilter* - class name.  
> {code}
> import javax.mail.Folder;
> import javax.mail.SearchTerm;
> clz implements MailEntityProcessor.CustomFilter() {    
> public SearchTerm getCustomSearch(Folder folder);
> }
> {code}
> *processAttachement* - defaults to true
> The below are the indexed fields.
> {code}
>   // Fields To Index
>   // single valued
>   private static final String SUBJECT = "subject";
>   private static final String FROM = "from";
>   private static final String SENT_DATE = "sentDate";
>   private static final String XMAILER = "xMailer";
>   // multi valued
>   private static final String TO_CC_BCC = "allTo";
>   private static final String FLAGS = "flags";
>   private static final String CONTENT = "content";
>   private static final String ATTACHMENT = "attachement";
>   private static final String ATTACHMENT_NAMES = "attachementNames";
>   // flag values
>   private static final String FLAG_ANSWERED = "answered";
>   private static final String FLAG_DELETED = "deleted";
>   private static final String FLAG_DRAFT = "draft";
>   private static final String FLAG_FLAGGED = "flagged";
>   private static final String FLAG_RECENT = "recent";
>   private static final String FLAG_SEEN = "seen";
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message