lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Suppe <ssu...@llnl.gov>
Subject Offset Questions
Date Fri, 07 Mar 2008 18:38:43 GMT
Hi all,

I'm trying to index documents so that a) I have all the documents indexed 
'normally' (in that I can search for documents that match certain words, 
and b) parts of the document that I consider important, such as author and 
title are ALSO stored in their own indexed fields.

I have (a) working fine, and (b) is almost working - however, I'm trying to 
force the separate field to have the original offsets of where it existed 
in the text.  As in, if the title was at characters 76-200 in the original 
text, I'd like the field to have that as its information, so when I look at 
the field I can find the place in the document quickly.

I don't seem to be able to do this - I have my own analyzer that finds the 
tokens and sets the start and end offsets accordingly.  However, when I 
create the new field and write it to the index, it seems like these offsets 
are ignored?  When I pull offsets out later, they start at 0 and move up 
from there.

I am creating the field like:

CASAnnotationAnalyzer psa = new CASAnnotationAnalyzer();
analyzer.addAnalyzer(info.indexName, psa);

TokenStream ts = psa.tokenStream(info.indexName,
                                              new StringReader(info.value));
Field stemF = new Field(info.indexName, ts,
                                     Field.TermVector.WITH_POSITIONS_OFFSETS);
d.add(stemF);

(d is the document being indexed).

I have tried various permutations of creating the field and token stream - 
does anyone have any insights, please?

Thanks in advance,
Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message