lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Suppe <>
Subject Re: Offset Questions (Follow-Up)
Date Fri, 07 Mar 2008 19:39:23 GMT
OK, I think I understand what's going on - it looks like I am able to set 
the token for the full author name (Say, "Steve Suppe") with the correct 
offsets, but the analyzer takes it once step further and tokenizes 'Steve' 
and 'Suppe' which is giving me a lot more generated offsets and is 
confusing me.

I like the tokenization, as it allows me to just search for Suppe and get 
results.  However, I don't want those "sub-offsets" returned.  Is there a 
way to distinguish the 'main' offsets for the whole field?

Thanks again,

At 10:38 AM 3/7/2008, you wrote:
>Hi all,
>I'm trying to index documents so that a) I have all the documents indexed 
>'normally' (in that I can search for documents that match certain words, 
>and b) parts of the document that I consider important, such as author and 
>title are ALSO stored in their own indexed fields.
>I have (a) working fine, and (b) is almost working - however, I'm trying 
>to force the separate field to have the original offsets of where it 
>existed in the text.  As in, if the title was at characters 76-200 in the 
>original text, I'd like the field to have that as its information, so when 
>I look at the field I can find the place in the document quickly.
>I don't seem to be able to do this - I have my own analyzer that finds the 
>tokens and sets the start and end offsets accordingly.  However, when I 
>create the new field and write it to the index, it seems like these 
>offsets are ignored?  When I pull offsets out later, they start at 0 and 
>move up from there.
>I am creating the field like:
>CASAnnotationAnalyzer psa = new CASAnnotationAnalyzer();
>analyzer.addAnalyzer(info.indexName, psa);
>TokenStream ts = psa.tokenStream(info.indexName,
>                                              new StringReader(info.value));
>Field stemF = new Field(info.indexName, ts,
>                                     Field.TermVector.WITH_POSITIONS_OFFSETS);
>(d is the document being indexed).
>I have tried various permutations of creating the field and token stream - 
>does anyone have any insights, please?
>Thanks in advance,
>To unsubscribe, e-mail:
>For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message