lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: Offset Questions (Follow-Up)
Date Fri, 07 Mar 2008 20:47:44 GMT
Our mails are crossing....

Not that I know of. But why don't you just index (or maybe just store)
a separate field containing your offset information? Something like
title_offset with, say, a comma-separated pair denoting char position
and length that you then read in at search time and parse.....

But your tokenizer controls *everything*. Why isn't the Token being
returned from your next() method being constructed with the
offsets you desire?

Erick

On Fri, Mar 7, 2008 at 2:39 PM, Steve Suppe <ssuppe@llnl.gov> wrote:

> OK, I think I understand what's going on - it looks like I am able to set
> the token for the full author name (Say, "Steve Suppe") with the correct
> offsets, but the analyzer takes it once step further and tokenizes 'Steve'
> and 'Suppe' which is giving me a lot more generated offsets and is
> confusing me.
>
> I like the tokenization, as it allows me to just search for Suppe and get
> results.  However, I don't want those "sub-offsets" returned.  Is there a
> way to distinguish the 'main' offsets for the whole field?
>
> Thanks again,
> Steve
>
> At 10:38 AM 3/7/2008, you wrote:
> >Hi all,
> >
> >I'm trying to index documents so that a) I have all the documents indexed
> >'normally' (in that I can search for documents that match certain words,
> >and b) parts of the document that I consider important, such as author
> and
> >title are ALSO stored in their own indexed fields.
> >
> >I have (a) working fine, and (b) is almost working - however, I'm trying
> >to force the separate field to have the original offsets of where it
> >existed in the text.  As in, if the title was at characters 76-200 in the
> >original text, I'd like the field to have that as its information, so
> when
> >I look at the field I can find the place in the document quickly.
> >
> >I don't seem to be able to do this - I have my own analyzer that finds
> the
> >tokens and sets the start and end offsets accordingly.  However, when I
> >create the new field and write it to the index, it seems like these
> >offsets are ignored?  When I pull offsets out later, they start at 0 and
> >move up from there.
> >
> >I am creating the field like:
> >
> >CASAnnotationAnalyzer psa = new CASAnnotationAnalyzer();
> >analyzer.addAnalyzer(info.indexName, psa);
> >
> >TokenStream ts = psa.tokenStream(info.indexName,
> >                                              new StringReader(info.value
> ));
> >Field stemF = new Field(info.indexName, ts,
> >
> Field.TermVector.WITH_POSITIONS_OFFSETS);
> >d.add(stemF);
> >
> >(d is the document being indexed).
> >
> >I have tried various permutations of creating the field and token stream
> -
> >does anyone have any insights, please?
> >
> >Thanks in advance,
> >Steve
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message