lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Hatcher <e...@ehatchersolutions.com>
Subject Re: Start/end offsets in analyzers
Date Wed, 28 Mar 2007 12:40:32 GMT

On Mar 28, 2007, at 3:51 AM, Antony Bowesman wrote:

> I'm fiddling with custom anaylyzers to analyze email addresses to  
> store the full email address and the component parts.  It's based  
> on Solr's analyzer framework, so I have a StandardTokenizerFactory  
> followed by a EmailFilterFactory.  It produces
>
> Analyzing "<humphrey.bogart@casablanca.com>"
>
> 1: [humphrey.bogart@casablanca.com:1->31:<EMAIL>]
> 2: [humphrey:1->9:<EMAIL>]
> 3: [bogart:10->16:<EMAIL>]
> 4: [casablanca:17->27:<EMAIL>]
> 5: [com:28->31:<EMAIL>]
>
> I set the start/end offset to be the length of the component, but  
> in the LIA book listing 4.6 shows the start/end offsets for the  
> synonyms as the same as the original token, whereas I set my start/ 
> end as the correct start/end for the length and offset of the part.
>
> LIA says these are not used in Lucene - is that still the case for  
> 2.1 and does this matter?

They aren't used implicitly by anything in Lucene, but can be very  
handy for efficient highlighting.  Where you set the offsets really  
all depends on how you plan on using the offset values.  In the  
synonym example you mention, if the original word is "dog" and the  
user searched for "canine", to properly highlight the word "dog" in  
the original text the offsets for "canine" need to be where "dog" is.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message