lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Barry Coughlan <b.coughl...@gmail.com>
Subject Re: Iterating TermsEnum for Long field produces zero values at the end
Date Mon, 17 Nov 2014 19:55:02 GMT
Makes sense, thanks. I switched the implementation to a FieldCache with no
noticeable performance difference:

private Longs cacheDocIds() throws IOException {
    AtomicReader wrapped = SlowCompositeReaderWrapper.wrap(reader);
    Longs vals = FieldCache.DEFAULT.getLongs(wrapped, "id", false);
    return vals;
}

Regards,
Barry

On Mon, Nov 17, 2014 at 6:50 PM, Uwe Schindler <uwe@thetaphi.de> wrote:

> Hi,
>
> > It is expected: those are the "prefix" terms, which come after all the
> full-
> > precision numeric terms.
> >
> > But I'm not sure why you see 0s ... the bytes should be unique for every
> term
> > you get back from the TermsEnum.
>
> That's easy to explain:
>
> The lower precision terms at the end have more than one doc in the
> DocsEnum, you always return only the first (Lucene docid 0, you never list
> all other entries in DocsEnum). The prefixcoded term has a shift value> 0
> and because bits are stripped from the right, the small long values will
> therefore return 0L after decoding.
>
> In general to have such a type of cache, I would not use terms and instead
> use numeric docvalues. An alternative is to use FieldCache, which does the
> right thing automatically. Relying on the internal implementation of
> numeric terms is not a good idea.
>
> Uwe
>
> > On Mon, Nov 17, 2014 at 10:39 AM, Barry Coughlan
> > <b.coughlan2@gmail.com> wrote:
> > > Hi all,
> > >
> > > I'm using 4.10.2. I have a Long "id" field. Each document has one "id"
> > > value. I am creating a look-up between Lucene's internal document id
> > > and my "id" values by enumerating the inverted index:
> > >
> > >     private long[] cacheDocIds() throws IOException {
> > >         long[] ourIds = new long[reader.maxDoc()];
> > >
> > >         Bits liveDocs = MultiFields.getLiveDocs(reader);
> > >         Fields fields = MultiFields.getFields(reader);
> > >         Terms terms = fields.terms("id");
> > >
> > >         TermsEnum iterator = terms.iterator(null);
> > >         BytesRef bytesRef = null;
> > >         while ((bytesRef = iterator.next()) != null) {
> > >             DocsEnum docsEnum = iterator.docs(liveDocs, null,
> > > DocsEnum.FLAG_NONE);
> > >
> > >             int luceneId = docsEnum.nextDoc();
> > >             long ourId = NumericUtils.prefixCodedToLong(bytesRef);
> > >             System.out.println(luceneId + " " + ourId);
> > >             ourIds[luceneId] = ourId;
> > >         }
> > >
> > >         return ourIds;
> > >     }
> > >
> > > With 5 documents (1, 2, 3, 4, 5) I get this output from the above code:
> > >
> > > 0 1
> > > 1 2
> > > 2 3
> > > 3 4
> > > 4 5
> > > 0 0
> > > 0 0
> > > 0 0
> > >
> > > I don't understand why there are three zeroes at the end.
> > >
> > > - reader.maxDoc is 5 and no documents have been deleted.
> > > - I have tried this with a varying number of documents and there are
> > > always three zeroes at the end.
> > > - I tried changing version to Lucene 4.10.0 and Lucene 4.9 and the
> > > same behavior occurs.
> > >
> > > I can work around this with but I'm just curious if this behavior is
> > > expected?
> > >
> > > Regards,
> > > Barry
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message