lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Suppe <ssu...@llnl.gov>
Subject Re: Offset Questions
Date Fri, 07 Mar 2008 21:51:35 GMT
Hi Erick,

Thanks for the response.  I think I'm starting to get the hang of 
this.  That's a really good insight, but I'm wondering how to handle that 
if a document can have multiple instances of the same field.  So, instead 
of Author, say, City names that are mentioned.  But, as you said, I control 
everything, so I may be able to work this out...

Still thinking :)  Thanks so much so far!

Steve

At 12:44 PM 3/7/2008, you wrote:
>What is your analyzer doing? Let's assume you're trying
>to index the title and that your entire text is
>
>"this is a book and HERE IS THE TITLE."
>
>I *think* your underlying analyzer should be returning
>4 tokens with starts of 20 for HERE, 25 for IS,
>28 for THE and 32 for TITTLE, with appropriate  ends.
>Is that what's happening? And perhaps
>
>If the value you're passing in to the analyzer is just the
>title and not the entire text, what you report seems
>perfectly reasonable to me....
>
>But I haven't worked with this very much so take
>this with the appropriate grain of salt...
>
>Best
>Erick
>
>
>On Fri, Mar 7, 2008 at 1:38 PM, Steve Suppe <ssuppe@llnl.gov> wrote:
>
> > Hi all,
> >
> > I'm trying to index documents so that a) I have all the documents indexed
> > 'normally' (in that I can search for documents that match certain words,
> > and b) parts of the document that I consider important, such as author and
> > title are ALSO stored in their own indexed fields.
> >
> > I have (a) working fine, and (b) is almost working - however, I'm trying
> > to
> > force the separate field to have the original offsets of where it existed
> > in the text.  As in, if the title was at characters 76-200 in the original
> > text, I'd like the field to have that as its information, so when I look
> > at
> > the field I can find the place in the document quickly.
> >
> > I don't seem to be able to do this - I have my own analyzer that finds the
> > tokens and sets the start and end offsets accordingly.  However, when I
> > create the new field and write it to the index, it seems like these
> > offsets
> > are ignored?  When I pull offsets out later, they start at 0 and move up
> > from there.
> >
> > I am creating the field like:
> >
> > CASAnnotationAnalyzer psa = new CASAnnotationAnalyzer();
> > analyzer.addAnalyzer(info.indexName, psa);
> >
> > TokenStream ts = psa.tokenStream(info.indexName,
> >                                              new StringReader(info.value
> > ));
> > Field stemF = new Field(info.indexName, ts,
> >
> > Field.TermVector.WITH_POSITIONS_OFFSETS);
> > d.add(stemF);
> >
> > (d is the document being indexed).
> >
> > I have tried various permutations of creating the field and token stream -
> > does anyone have any insights, please?
> >
> > Thanks in advance,
> > Steve
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message