lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael J. Prichard" <>
Subject Re: indexing emails
Date Sun, 18 Jun 2006 14:24:13 GMT
This is great!  I was hoping to find some people who are dealing with 
this issue.  I am going to try to tokenize the email addresses and see 
what that does.  I am going to use a StandardAnalyzer which (if I am not 
mistaken) will keep the email address as is.  Would I still have to use 
PrefixQuery for queries such as john*?

Anyone else want to comment?


Rob Staveley (Tom) wrote:

>Having spent a lot of time getting this wrong myself in an e-mail
>indexer(!), I urge you to consider whether in your query interface you will
>need to look for mail to "john*" rather than, because "john*"
>may have been addressed to or If you index
>only (untokenised) you will have to use a PrefixQuery to look
>for "john*", and you are liable to hit BooleanQuery.TooManyClauses problems,
>if you have more than 1024 (or BooleanQuery.getMaxClauseCount()) e-mail
>addresses in your index starting with "john". 
>I'm trying to figure out a good design for this now for my own e-mail
>indexing application, considering also whether I should cater for searches
>for "*smith*". I'm coming round to the realisation that WildCardQuery and
>PrefixQuery are not great things to depend upon for getting e-mail addresses
>from an index and the right thing to do is to break the address up into
>natural tokens ('.' or '-') in one field and leave them intact in another
>field. It isn't ideal; e-mail addresses with no separator between initials
>or first names and last name still need a PrefixQuery or WildcardQuery, if
>you want to search for last names, but it does make some queries possible
>which would otherwise blow up.
>-----Original Message-----
>From: karl wettin [] 
>Sent: 16 June 2006 21:13
>Subject: Re: indexing emails
>On Fri, 2006-06-16 at 15:20 -0400, Michael J. Prichard wrote:
>>I am working on indexing emails and want to have a "to" field.  I am 
>>currently putting all the emails on one line seperated w/
>>Then i index that with a StandardAnalyzer as follows:
>>doc.add(new Field("to", (String) itemContent.get("to"), 
>>Field.Store.YES, Field.Index.UN_TOKENIZED));
>>Question this the best way to do it?  I want to be able to 
>>search for and pick out just those Documents, etc.
>You can either do it as above (but you want to TOKENIZE the field) or you
>could create a new UN_TOKENIZED field for each email address.
>The second will require less CPU as it does not involve any lexical
>analysis. It will also create larger distance between the addresses in the
>index (see span queries and term positions).
>To unsubscribe, e-mail:
>For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message