lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael J. Prichard" <michael_prich...@mac.com>
Subject Re: indexing emails
Date Mon, 19 Jun 2006 12:32:56 GMT
We are actually grabbing emails by becoming part of the SMTP stream.  
This part is figured out and we have archived over 600k emails into a 
mysql database.  The problem is that since we currently store the blobs 
in the DB this databases are getting large and searching takes plenty of 
time.  We want to convert the searching to lucene to add more advanced 
features.

Can I have multiple "to", "from" and "bcc" fields?

-Michael

Rob Staveley (Tom) wrote:

>>you cannot index PST files standalone
>>    
>>
>
>You can with LibPST (a C library - see
>http://sourceforge.net/projects/ol2mbox), if they are 97-2002 format.
>
>-----Original Message-----
>From: Mike Streeton [mailto:mike.streeton@ardentia.co.uk] 
>Sent: 19 June 2006 08:33
>To: java-user@lucene.apache.org
>Subject: RE: indexing emails
>
>When you talk about indexing emails are you indexing Outlook mails? We have
>only found a few libraries that will do this and all require Outlook to be
>online at the time i.e. you cannot index PST files standalone.
>
>As far as indexing goes index each address in a separate un-tokenized field
>not space delimited in a single field. It is also useful to put the To; CC
>and BCC in a single field to enable you to search to email you have sent to
>a person. I would also recommend you do some processing on the Subject field
>to remove FW and RE this will allow you to search by subject and pick up all
>emails in the thread.
>
>Mike
>
>-----Original Message-----
>From: Michael Wechner [mailto:michael.wechner@wyona.com]
>Sent: 19 June 2006 08:21
>To: java-user@lucene.apache.org
>Subject: Re: indexing emails
>
>Rob Staveley (Tom) wrote:
>  
>
>>Having spent a lot of time getting this wrong myself in an e-mail 
>>indexer(!), I urge you to consider whether in your query interface you
>>    
>>
>will
>  
>
>>need to look for mail to "john*" rather than john@boo.com, because
>>    
>>
>"john*"
>  
>
>>may have been addressed to john@boo.net or john.smith@boo2.com. If you
>>    
>>
>index
>  
>
>>only john@boo.com (untokenised) you will have to use a PrefixQuery to
>>    
>>
>look
>  
>
>>for "john*", and you are liable to hit BooleanQuery.TooManyClauses
>>    
>>
>problems,
>  
>
>>if you have more than 1024 (or BooleanQuery.getMaxClauseCount())
>>    
>>
>e-mail
>  
>
>>addresses in your index starting with "john". 
>>
>>I'm trying to figure out a good design for this now for my own e-mail 
>>indexing application,
>>    
>>
>
>btw, is your code available somewhere, I mean as Open Source ;-) ?
>
>Thanks
>
>Michi
>  
>
>> considering also whether I should cater for searches for "*smith*". 
>>I'm coming round to the realisation that WildCardQuery
>>    
>>
>and
>  
>
>>PrefixQuery are not great things to depend upon for getting e-mail
>>    
>>
>addresses
>  
>
>>from an index and the right thing to do is to break the address up
>>    
>>
>into
>  
>
>>natural tokens ('.' or '-') in one field and leave them intact in
>>    
>>
>another
>  
>
>>field. It isn't ideal; e-mail addresses with no separator between
>>    
>>
>initials
>  
>
>>or first names and last name still need a PrefixQuery or
>>    
>>
>WildcardQuery, if
>  
>
>>you want to search for last names, but it does make some queries
>>    
>>
>possible
>  
>
>>which would otherwise blow up.
>>
>>-----Original Message-----
>>From: karl wettin [mailto:kalle@snigel.net]
>>Sent: 16 June 2006 21:13
>>To: java-user@lucene.apache.org
>>Subject: Re: indexing emails
>>
>>On Fri, 2006-06-16 at 15:20 -0400, Michael J. Prichard wrote:
>>  
>>    
>>
>>>I am working on indexing emails and want to have a "to" field.  I am 
>>>currently putting all the emails on one line seperated w/
>>>    
>>>      
>>>
>>spaces...example:
>>  
>>    
>>
>>>michael@foo.bar john@boo.com jane@bar.com
>>>
>>>Then i index that with a StandardAnalyzer as follows:
>>>
>>>doc.add(new Field("to", (String) itemContent.get("to"), 
>>>Field.Store.YES, Field.Index.UN_TOKENIZED));
>>>
>>>Question is...is this the best way to do it?  I want to be able to 
>>>search for michael@foo.bar and pick out just those Documents, etc.
>>>    
>>>      
>>>
>>You can either do it as above (but you want to TOKENIZE the field) or
>>    
>>
>you
>  
>
>>could create a new UN_TOKENIZED field for each email address.
>>
>>The second will require less CPU as it does not involve any lexical 
>>analysis. It will also create larger distance between the addresses in
>>    
>>
>the
>  
>
>>index (see span queries and term positions).
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>For additional commands, e-mail: java-user-help@lucene.apache.org
>>  
>>    
>>
>
>
>--
>Michael Wechner
>Wyona      -   Open Source Content Management   -    Apache Lenya
>http://www.wyona.com                      http://lenya.apache.org
>michael.wechner@wyona.com                        michi@apache.org
>+41 44 272 91 61
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>For additional commands, e-mail: java-user-help@lucene.apache.org
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message