lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Wechner <>
Subject Re: indexing emails
Date Mon, 19 Jun 2006 07:21:14 GMT
Rob Staveley (Tom) wrote:
> Having spent a lot of time getting this wrong myself in an e-mail
> indexer(!), I urge you to consider whether in your query interface you will
> need to look for mail to "john*" rather than, because "john*"
> may have been addressed to or If you index
> only (untokenised) you will have to use a PrefixQuery to look
> for "john*", and you are liable to hit BooleanQuery.TooManyClauses problems,
> if you have more than 1024 (or BooleanQuery.getMaxClauseCount()) e-mail
> addresses in your index starting with "john". 
> I'm trying to figure out a good design for this now for my own e-mail
> indexing application,

btw, is your code available somewhere, I mean as Open Source ;-) ?


>  considering also whether I should cater for searches
> for "*smith*". I'm coming round to the realisation that WildCardQuery and
> PrefixQuery are not great things to depend upon for getting e-mail addresses
> from an index and the right thing to do is to break the address up into
> natural tokens ('.' or '-') in one field and leave them intact in another
> field. It isn't ideal; e-mail addresses with no separator between initials
> or first names and last name still need a PrefixQuery or WildcardQuery, if
> you want to search for last names, but it does make some queries possible
> which would otherwise blow up.
> -----Original Message-----
> From: karl wettin [] 
> Sent: 16 June 2006 21:13
> To:
> Subject: Re: indexing emails
> On Fri, 2006-06-16 at 15:20 -0400, Michael J. Prichard wrote:
>> I am working on indexing emails and want to have a "to" field.  I am 
>> currently putting all the emails on one line seperated w/
> spaces...example:
>> Then i index that with a StandardAnalyzer as follows:
>> doc.add(new Field("to", (String) itemContent.get("to"), 
>> Field.Store.YES, Field.Index.UN_TOKENIZED));
>> Question this the best way to do it?  I want to be able to 
>> search for and pick out just those Documents, etc.
> You can either do it as above (but you want to TOKENIZE the field) or you
> could create a new UN_TOKENIZED field for each email address.
> The second will require less CPU as it does not involve any lexical
> analysis. It will also create larger distance between the addresses in the
> index (see span queries and term positions).
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

Michael Wechner
Wyona      -   Open Source Content Management   -    Apache Lenya                          
+41 44 272 91 61

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message