lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Wechner <michael.wech...@wyona.com>
Subject Re: indexing emails
Date Mon, 19 Jun 2006 07:21:14 GMT
Rob Staveley (Tom) wrote:
> Having spent a lot of time getting this wrong myself in an e-mail
> indexer(!), I urge you to consider whether in your query interface you will
> need to look for mail to "john*" rather than john@boo.com, because "john*"
> may have been addressed to john@boo.net or john.smith@boo2.com. If you index
> only john@boo.com (untokenised) you will have to use a PrefixQuery to look
> for "john*", and you are liable to hit BooleanQuery.TooManyClauses problems,
> if you have more than 1024 (or BooleanQuery.getMaxClauseCount()) e-mail
> addresses in your index starting with "john". 
>
> I'm trying to figure out a good design for this now for my own e-mail
> indexing application,

btw, is your code available somewhere, I mean as Open Source ;-) ?

Thanks

Michi
>  considering also whether I should cater for searches
> for "*smith*". I'm coming round to the realisation that WildCardQuery and
> PrefixQuery are not great things to depend upon for getting e-mail addresses
> from an index and the right thing to do is to break the address up into
> natural tokens ('.' or '-') in one field and leave them intact in another
> field. It isn't ideal; e-mail addresses with no separator between initials
> or first names and last name still need a PrefixQuery or WildcardQuery, if
> you want to search for last names, but it does make some queries possible
> which would otherwise blow up.
>
> -----Original Message-----
> From: karl wettin [mailto:kalle@snigel.net] 
> Sent: 16 June 2006 21:13
> To: java-user@lucene.apache.org
> Subject: Re: indexing emails
>
> On Fri, 2006-06-16 at 15:20 -0400, Michael J. Prichard wrote:
>   
>> I am working on indexing emails and want to have a "to" field.  I am 
>> currently putting all the emails on one line seperated w/
>>     
> spaces...example:
>   
>> michael@foo.bar john@boo.com jane@bar.com
>>
>> Then i index that with a StandardAnalyzer as follows:
>>
>> doc.add(new Field("to", (String) itemContent.get("to"), 
>> Field.Store.YES, Field.Index.UN_TOKENIZED));
>>
>> Question is...is this the best way to do it?  I want to be able to 
>> search for michael@foo.bar and pick out just those Documents, etc.
>>     
>
> You can either do it as above (but you want to TOKENIZE the field) or you
> could create a new UN_TOKENIZED field for each email address.
>
> The second will require less CPU as it does not involve any lexical
> analysis. It will also create larger distance between the addresses in the
> index (see span queries and term positions).
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>   


-- 
Michael Wechner
Wyona      -   Open Source Content Management   -    Apache Lenya
http://www.wyona.com                      http://lenya.apache.org
michael.wechner@wyona.com                        michi@apache.org
+41 44 272 91 61


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message