lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Babak Farhang <farh...@gmail.com>
Subject Re: Redundant fields Token class?
Date Sat, 14 Nov 2009 03:30:23 GMT
Thanks for your explanations. I think I have a basic understanding now.

What I'm not so sure about, now, is how to decide on the start and
ending offsets when the TokenFilter implementation wants to break an
input token into subtokens. Should the offsets of the emitted
subtokens be the same as the original input token?  Should I only have
highlighting in mind when setting these offsets, or are there other
things to consider (e.g. impact on search)?

I'll check out some of the contrib filters and Solr's
WordDelimiterFilter to see how they handle this. But if you know any
rules of thumb I should follow please share..

-Babak

PS Hope this kind of follow-up question is not considered bad etiquette.

On Fri, Nov 13, 2009 at 4:20 PM, Robert Muir <rcmuir@gmail.com> wrote:
> Another example is if you used a stemmer, it might change the termLength:
> (walking -> walk), but the offsets of the original unstemmed word (walking)
> stay the same.
>
> On Fri, Nov 13, 2009 at 6:01 PM, Uwe Schindler <uwe@thetaphi.de> wrote:
>
>> This is not coupled because:
>>
>> termLength() is the number of chars in the term buffer, where the offsets
>> give the offsets in the orginal char stream. If you use a CharFilter to
>> e.g.
>> remove chars, the termLength will get shorter, but the offset are still the
>> original ones. Also both things are indexed in different ways, the
>> termLength and offsets have no relation and must (as said before) not even
>> follow a contract like end-start=length.
>>
>> -----
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: uwe@thetaphi.de
>>
>> > -----Original Message-----
>> > From: Babak Farhang [mailto:farhang@gmail.com]
>> > Sent: Friday, November 13, 2009 11:50 PM
>> > To: java-user@lucene.apache.org
>> > Subject: Redundant fields Token class?
>> >
>> > I'm writing a TokenFilter and am confused about why class Token has
>> > both an *endOffset* and a *termLength* field.  It would appear that
>> > the following invariant should always hold for a Token instance:
>> >
>> >     termLength() == endOffset() - startOffset()
>> >
>> > If so, then
>> >
>> > 1) Why 2 fields, instead of 1?
>> > 2) Why isn't the invariant enforced in the class?
>> >
>> > -Babak
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message