lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doron Cohen <DOR...@il.ibm.com>
Subject Re: get terms by positions
Date Mon, 02 Oct 2006 22:17:04 GMT
You can store TermVectors with position info, but I don't think this would
be enough for what you are asking, because it is not meant for direct
access to a term by its position, and because TermVectors store tokens,
i.e. the "indexed" form of the word, which I am not sure is what you need.

It seems doable by the following:

(1) store the field with the document - see
http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field.Store.html#YES

(2) store term vectors with offsets - see
http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Field.TermVector.html#WITH_OFFSETS

(3) access the TermVector of a document by docid - see
http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexReader.html#getTermFreqVector(int,
 java.lang.String) and
http://lucene.apache.org/java/docs/api/org/apache/lucene/index/TermPositionVector.html

(4) Use the offset info to extract the relevant part of the original text
from the field, iterating backwards and forwards for a whitespace or so -
see
http://lucene.apache.org/java/docs/api/org/apache/lucene/index/IndexReader.html#document(int,
 org.apache.lucene.document.FieldSelector)

All this seems a lot of work, and I am also not sure about the result
performance - the index size would grow due to both TermVectors and the
stored field. If the field content for each doc is also stored in another
store (say, db) then this is less of a concern, but still, lots of work,
and measureable extra IO (in addition to search) if this is part of a
search application.

Perhaps you can expand on the context of this? How is this feature going to
be used?

"Renzo Scheffer" <renzo.scheffer@gmx.de> wrote on 02/10/2006 14:06:40:
> Hi,
>
>
>
> can anybody be so kind to tell me if it is possible to search a Term by
its
> position?
>
>
>
> I search a term (for excample "soccer") and get back the DocId's and
> positions as follows:
>
>
>
>
>
> TermPositions termPos = reader.termPositions(new
Term("contents","soccer"));
>
> while(termPos.next()){
>
> int freq = termPos.freq();
>
> for(int i=0; i<freq; i++){
>
>
>
>       int docNumber = termPos.doc();
>
>       int position = termPos.nextPosition();
>
> System.out.println("DocId: "+docNumber+"; Pos:"+position);
>
> }
>
>
>
>
>
>
>
> Output:
>
>
>
> DocId: 0; Pos: 1
>
> DocId: 0; Pos: 4
>
> DocId: 0; Pos: 7
>
> DocId: 1; Pos: 3
>
> DocId: 1; Pos: 7
>
>
>
> Now I try to get back terms, one position before/after "soccer". I
> considered to take the
>
> Position and increase or decrease it. But I can't find a way to get back
a
> term, according to the given Position.
>
> Can anybody help me?
>
>
>
> Thanks, Renzo


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message