lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: Redundant fields Token class?
Date Sat, 14 Nov 2009 04:01:23 GMT
Babak if your filter splits a token into two output tokens,
i think you want to adjust the end offset of the first output token,
and the start offset of the second.

Babak, for a fairly simple example of this, you can look at the
ThaiWordFilter in the lucene contrib-analyzers package.

it has to break input tokens into subtokens and correct offsets... sounds
like you are on the right track though.

On Fri, Nov 13, 2009 at 10:30 PM, Babak Farhang <farhang@gmail.com> wrote:

> Thanks for your explanations. I think I have a basic understanding now.
>
> What I'm not so sure about, now, is how to decide on the start and
> ending offsets when the TokenFilter implementation wants to break an
> input token into subtokens. Should the offsets of the emitted
> subtokens be the same as the original input token?  Should I only have
> highlighting in mind when setting these offsets, or are there other
> things to consider (e.g. impact on search)?
>
> I'll check out some of the contrib filters and Solr's
> WordDelimiterFilter to see how they handle this. But if you know any
> rules of thumb I should follow please share..
>
> -Babak
>
> PS Hope this kind of follow-up question is not considered bad etiquette.
>
> On Fri, Nov 13, 2009 at 4:20 PM, Robert Muir <rcmuir@gmail.com> wrote:
> > Another example is if you used a stemmer, it might change the termLength:
> > (walking -> walk), but the offsets of the original unstemmed word
> (walking)
> > stay the same.
> >
> > On Fri, Nov 13, 2009 at 6:01 PM, Uwe Schindler <uwe@thetaphi.de> wrote:
> >
> >> This is not coupled because:
> >>
> >> termLength() is the number of chars in the term buffer, where the
> offsets
> >> give the offsets in the orginal char stream. If you use a CharFilter to
> >> e.g.
> >> remove chars, the termLength will get shorter, but the offset are still
> the
> >> original ones. Also both things are indexed in different ways, the
> >> termLength and offsets have no relation and must (as said before) not
> even
> >> follow a contract like end-start=length.
> >>
> >> -----
> >> Uwe Schindler
> >> H.-H.-Meier-Allee 63, D-28213 Bremen
> >> http://www.thetaphi.de
> >> eMail: uwe@thetaphi.de
> >>
> >> > -----Original Message-----
> >> > From: Babak Farhang [mailto:farhang@gmail.com]
> >> > Sent: Friday, November 13, 2009 11:50 PM
> >> > To: java-user@lucene.apache.org
> >> > Subject: Redundant fields Token class?
> >> >
> >> > I'm writing a TokenFilter and am confused about why class Token has
> >> > both an *endOffset* and a *termLength* field.  It would appear that
> >> > the following invariant should always hold for a Token instance:
> >> >
> >> >     termLength() == endOffset() - startOffset()
> >> >
> >> > If so, then
> >> >
> >> > 1) Why 2 fields, instead of 1?
> >> > 2) Why isn't the invariant enforced in the class?
> >> >
> >> > -Babak
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
> >
> > --
> > Robert Muir
> > rcmuir@gmail.com
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Robert Muir
rcmuir@gmail.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message