lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael J. Prichard" <michael_prich...@mac.com>
Subject Re: indexing emails
Date Sun, 18 Jun 2006 14:24:13 GMT
This is great!  I was hoping to find some people who are dealing with 
this issue.  I am going to try to tokenize the email addresses and see 
what that does.  I am going to use a StandardAnalyzer which (if I am not 
mistaken) will keep the email address as is.  Would I still have to use 
PrefixQuery for queries such as john*?

Anyone else want to comment?

-Michael

Rob Staveley (Tom) wrote:

>Having spent a lot of time getting this wrong myself in an e-mail
>indexer(!), I urge you to consider whether in your query interface you will
>need to look for mail to "john*" rather than john@boo.com, because "john*"
>may have been addressed to john@boo.net or john.smith@boo2.com. If you index
>only john@boo.com (untokenised) you will have to use a PrefixQuery to look
>for "john*", and you are liable to hit BooleanQuery.TooManyClauses problems,
>if you have more than 1024 (or BooleanQuery.getMaxClauseCount()) e-mail
>addresses in your index starting with "john". 
>
>I'm trying to figure out a good design for this now for my own e-mail
>indexing application, considering also whether I should cater for searches
>for "*smith*". I'm coming round to the realisation that WildCardQuery and
>PrefixQuery are not great things to depend upon for getting e-mail addresses
>from an index and the right thing to do is to break the address up into
>natural tokens ('.' or '-') in one field and leave them intact in another
>field. It isn't ideal; e-mail addresses with no separator between initials
>or first names and last name still need a PrefixQuery or WildcardQuery, if
>you want to search for last names, but it does make some queries possible
>which would otherwise blow up.
>
>-----Original Message-----
>From: karl wettin [mailto:kalle@snigel.net] 
>Sent: 16 June 2006 21:13
>To: java-user@lucene.apache.org
>Subject: Re: indexing emails
>
>On Fri, 2006-06-16 at 15:20 -0400, Michael J. Prichard wrote:
>  
>
>>I am working on indexing emails and want to have a "to" field.  I am 
>>currently putting all the emails on one line seperated w/
>>    
>>
>spaces...example:
>  
>
>>michael@foo.bar john@boo.com jane@bar.com
>>
>>Then i index that with a StandardAnalyzer as follows:
>>
>>doc.add(new Field("to", (String) itemContent.get("to"), 
>>Field.Store.YES, Field.Index.UN_TOKENIZED));
>>
>>Question is...is this the best way to do it?  I want to be able to 
>>search for michael@foo.bar and pick out just those Documents, etc.
>>    
>>
>
>You can either do it as above (but you want to TOKENIZE the field) or you
>could create a new UN_TOKENIZED field for each email address.
>
>The second will require less CPU as it does not involve any lexical
>analysis. It will also create larger distance between the addresses in the
>index (see span queries and term positions).
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>For additional commands, e-mail: java-user-help@lucene.apache.org
>  
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message