lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jefferson French <jkfaus...@gmail.com>
Subject Unable to get offsets using AtomicReader.termPositionsEnum(Term)
Date Fri, 07 Mar 2014 19:37:02 GMT
We have an API on top of Lucene 4.6 that I'm trying to adapt to running
under Solr 4.6. The problem is although I'm getting the correct offsets
when the index is created by Lucene, the same method calls always return -1
when the index is created by Solr. In the latter case I can see the
character offsets via Luke, and I can even get them from Solr when I access
the /tvrh search handler, which uses the TermVectorComponent class.

This is roughly how I'm reading character offsets in my Lucene code:

> AtomicReader reader = ...
> Term term = ...
> DocsAndPositionsEnum postings = reader.termPositionsEnum(term);
> while (postings.nextDoc() != DocsAndPositionsEnum.NO_MORE_DOCS) {
>   for (int i = 0; i < postings.freq(); i++) {
>     System.out.println("start:" + postings.startOffset());
>     System.out.println("end:" + postings.endOffset());
>   }
> }


Notice that I want the values for a single term. When run against an index
created by Solr, the above calls to startOffset() and endOffset() return
-1. Solr's TermVectorComponent prints the correct offsets like this
(paraphrased):

IndexReader reader = searcher.getIndexReader();
> Terms vector = reader.getTermVector(docId, field);
> TermsEnum termsEnum = vector.iterator(termsEnum);
> int freq = (int) termsEnum.totalTermFreq();
> DocsAndPositionsEnum dpEnum = null;
> while((text = termsEnum.next()) != null) {
>   String term = text.utf8ToString();
>   dpEnum = termsEnum.docsAndPositions(null, dpEnum);
>   dpEnum.nextDoc();
>   for (int i = 0; i < freq; i++) {
>     final int pos = dpEnum.nextPosition();
>     System.out.println("start:" + dpEnum.startOffset());
>     System.out.println("end:" + dpEnum.endOffset());
>   }
> }


but in this case it is getting the offsets per doc ID, rather than a single
term, which is what I want.

Could anyone tell me:

   1. Why I'm not able to get the offsets using my first example, and/or
   2. A better way to get the offsets for a given term?

Thanks.

       Jeff

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message