lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brendan Grainger <brendan.grain...@gmail.com>
Subject Re: Question about startOffset and endOffset
Date Mon, 12 May 2008 21:11:25 GMT
Hi Erick,

Thanks for the reply. The use case I have is this:

Say you have a synonym expansion like this:

ac -> air conditioning

And to keep it simple, a document where the first term is ac. When  
analyzing the document I currently create a token stream that looks  
something like this for the 'ac' term:

ac -> Token(positionIncrement = 1, startOffset = 0, endOffset = 2,  
type = 'word')
air -> Token(positionIncrement = 0, startOffset = 0, endOffset = 2,  
type = 'synonym')
conditioning -> Token(positionIncrement = 0, startOffset = 0,  
endOffset = 2, type = 'synonym')

Now what I'd like to do is be able to do is to display the fact that  
this expansion took place to the user when they query. However, now I  
don't know how to reconstruct the word 'air conditioning'. If however,  
I do this:

ac -> Token(positionIncrement = 1, startOffset = 0, endOffset = 2,  
type = 'word')
air -> Token(positionIncrement = 0, startOffset = 0, endOffset = 3,  
type = 'synonym')
conditioning -> Token(positionIncrement = 0, startOffset = 3,  
endOffset = 15, type = 'synonym')

I can reconstruct the fact that ac maps to air conditioning.

Thanks
Brendan

On May 12, 2008, at 12:23 PM, Erick Erickson wrote:

> Is this a theoretical question or is there a use-case you're trying
> to support? If the latter, a statement of the problem you're trying
> to solve would be helpful.
>
> If the former, setting all your start offsets to 0 seems wrong. You're
> essentially saying that all tokens are at the beginning of the  
> document
> and each one is some ever-increasing distance from the start. Some
> of your later words will look really, really long (like the length
> of your document).
>
> Offhand, I expect this will affect up span queries, phrase
> queries, and who knows what else? Maybe scoring?
> Or maybe not, maybe some (or all) of these work off of
> termpositions rather than offsets. So unless you're trying to
> accomplish something specific, I sure wouldn't do this
> given the problems it *could* create.
>
> That said, I'm not much of an expert on how the offsets are
> used, but I'd be really leery of changing them "just for the fun of
> it" <G>......
>
> Best
> Erick
>
> On Mon, May 12, 2008 at 12:06 PM, Brendan Grainger <
> brendan.grainger@gmail.com> wrote:
>
>> Hi,
>>
>> I have a TokenStream that inserts synonym tokens into the stream when
>> matched. One thing I am wondering about is what is the effect of the
>> startOffset and endOffset. I have something like this:
>>
>> Token synonymToken = new Token(originalToken.startOffset(),
>> originalToken.endOffset(), "SYNONYM");
>> synonymToken.setPositionIncrement(0);
>>
>> What I am wondering is if I set the startOffset and endOffset to 0  
>> and the
>> endOffset to the length of the synonym string what effect will this  
>> have?
>>
>> eg
>> Token synonymToken = new Token(0, repTok.endOffset(), "SYNONYM");
>> synonymToken.setPositionIncrement(0);
>>
>> Thanks
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message