lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: Offset Questions
Date Fri, 07 Mar 2008 20:44:10 GMT
What is your analyzer doing? Let's assume you're trying
to index the title and that your entire text is

"this is a book and HERE IS THE TITLE."

I *think* your underlying analyzer should be returning
4 tokens with starts of 20 for HERE, 25 for IS,
28 for THE and 32 for TITTLE, with appropriate  ends.
Is that what's happening? And perhaps

If the value you're passing in to the analyzer is just the
title and not the entire text, what you report seems
perfectly reasonable to me....

But I haven't worked with this very much so take
this with the appropriate grain of salt...

Best
Erick


On Fri, Mar 7, 2008 at 1:38 PM, Steve Suppe <ssuppe@llnl.gov> wrote:

> Hi all,
>
> I'm trying to index documents so that a) I have all the documents indexed
> 'normally' (in that I can search for documents that match certain words,
> and b) parts of the document that I consider important, such as author and
> title are ALSO stored in their own indexed fields.
>
> I have (a) working fine, and (b) is almost working - however, I'm trying
> to
> force the separate field to have the original offsets of where it existed
> in the text.  As in, if the title was at characters 76-200 in the original
> text, I'd like the field to have that as its information, so when I look
> at
> the field I can find the place in the document quickly.
>
> I don't seem to be able to do this - I have my own analyzer that finds the
> tokens and sets the start and end offsets accordingly.  However, when I
> create the new field and write it to the index, it seems like these
> offsets
> are ignored?  When I pull offsets out later, they start at 0 and move up
> from there.
>
> I am creating the field like:
>
> CASAnnotationAnalyzer psa = new CASAnnotationAnalyzer();
> analyzer.addAnalyzer(info.indexName, psa);
>
> TokenStream ts = psa.tokenStream(info.indexName,
>                                              new StringReader(info.value
> ));
> Field stemF = new Field(info.indexName, ts,
>
> Field.TermVector.WITH_POSITIONS_OFFSETS);
> d.add(stemF);
>
> (d is the document being indexed).
>
> I have tried various permutations of creating the field and token stream -
> does anyone have any insights, please?
>
> Thanks in advance,
> Steve
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message