lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karolina Bernat <>
Subject Re: Token position vs. token offset - how to bring them together?
Date Tue, 01 Feb 2011 09:59:29 GMT

perhaps there is someone, who's trying to do the same thing so I just write
down, how I got along with this problem. It is NOT the most elegant
solution, but it works for me. I don't really know yet, how the performance
of my search will be, but the tests look so far ok.

For my phrase search I actually used the SpnQuery - I read that the
QueryParser cant't handle this kind of queries so I do it manually by
checking, if the user entered the search text within the " ".
Handling a SpanQuery one have an access to the query spans - and the spans
give you the start and the end position of the searched words.
Furthermore I could find the positions and the offset information of the
WeightedTerms by using :
QueryTermExtractor.getIdfWeightedTerms(...)  and  TermPositionVector

Because there is no possibility (or none that I know of) of getting the
offset information if you know the terms positions, I thought of saving all
the term positions and the term offset informations.. and since I get the
span start- and end-position from a SpanQuery I can look up in the terms
positions-array, at which index/place in the array I find the position I got
from the SpanQuery and then go to my array with terms offset information and
get the one at the same index/position in this array...
With those informations I can get the start offset of the first term in the
SpanQuery and the end offset of the last term - and I can highlight those

That is really not the best way to process, but I couldn't find any better.

Please let me know, if there is any other (better) way to do it.

On Fri, Jan 28, 2011 at 4:41 PM, Karolina Bernat <> wrote:

> Hello,
> since I moved on with my offset-info problem in HTML files, I got a new one
> trying to bring the tokens positions information together with tokens/term
> offset information. Can someone tell me, how can I get a token, if I know
> its position? It would be nice to get the tokens position from the Token
> class, but I could only get the positionIncrement, which is not really
> helpful..
> What I'm actually trying to do, is to find the offset information of a
> span/phrase query. I know, that the contrib highligter can highlight phrase
> queries, but I want/need to do it one my own (or rather give the information
> to another application, that does the highlighting of my documents). I also
> couldn't really understand, how does the highlighter recognize, that the
> individual tokens/terms belong to the phrase (i.e. if I search for "peter
> pan" at the moment I also get the tokens 'peter' and 'pan' as weighted
> terms, also if they occur individually).
> Thanks so much in advance!
> Karolina

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message