lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Antony Bowesman <...@teamware.com>
Subject Start/end offsets in analyzers
Date Wed, 28 Mar 2007 07:51:33 GMT
I'm fiddling with custom anaylyzers to analyze email addresses to store the full 
email address and the component parts.  It's based on Solr's analyzer framework, 
so I have a StandardTokenizerFactory followed by a EmailFilterFactory.  It produces

Analyzing "<humphrey.bogart@casablanca.com>"

1: [humphrey.bogart@casablanca.com:1->31:<EMAIL>]
2: [humphrey:1->9:<EMAIL>]
3: [bogart:10->16:<EMAIL>]
4: [casablanca:17->27:<EMAIL>]
5: [com:28->31:<EMAIL>]

I set the start/end offset to be the length of the component, but in the LIA 
book listing 4.6 shows the start/end offsets for the synonyms as the same as the 
original token, whereas I set my start/end as the correct start/end for the 
length and offset of the part.

LIA says these are not used in Lucene - is that still the case for 2.1 and does 
this matter?

Thanks
Antony




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message