lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <gsing...@apache.org>
Subject Re: Token character positions
Date Wed, 18 Nov 2009 11:24:49 GMT

On Nov 17, 2009, at 10:37 AM, Christopher Tignor wrote:

> Hello,
> 
> Hoping someone might clear up a question for me:
> 
> When Tokenizing we provide the start and end character offsets for each
> token locating it within the source text.
> 
> If I tokenize the text "word" and then serach for the term "word" in the
> same field, how can I recover this character offset information in the
> matching documents to precisely locate the word?  I have been storing this
> character info myself using payload data but if lucene stores it, then I am
> doing so needlessly.  If recovering this character offset info isn't
> possible, what is this charcter offset info used for?

Lucene doesn't, currently, store offset information, so you are not duplicating this.

There are 4 possible ways to store it that I know of, one of which is under construction now
and will eventually be the best solution:

1. Payloads - in fact, there is a TokenFilter in the payloads package under contrib/analysis
that does just this.
2. Term Vectors - Stores lots of info along w/ the offsets.  I've often used this along w/
SpanQueries to get precise locations
3. Hack up the highlighter (assuming you aren't just doing this for highlighting)
4. Flexible indexing (future) - Create your index your way and store the offset info in a
strongly typed payload.  This will require writing your own code.

-Grant
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message