lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ahmet Arslan <>
Subject Re: getting number of terms in a document/field
Date Fri, 06 Feb 2015 13:51:59 GMT
Hi Michael,

Thanks for the explanation. I am working with a TREC dataset, 
since it is static, I set size of that array experimentally. 

I followed the DefaultSimilarity#lengthNorm method a bit.

If default similarity and no index time boost is used, 
I assume that norm equals to  1.0 / Math.sqrt(numTerms).

First option is somehow obtain pre-computed norm value and apply reverse operation to obtain
numTerms = (1/norm)^2  This will be an approximation because norms are stored in a byte.
How do I access that norm value for a given docid and a field?

Second option, I store numTerms as a separate field, like any other organic fields.
Do I need to calculate it by myself? Or can I access above already computed numTerms value
during indexing? 

I think I will follow second option.
Is there a pointer where reading/writing a DocValues based field example is demostrated?


On Friday, February 6, 2015 11:08 AM, Michael McCandless <>
How will you know how large to allocate that array?  The within-doc
term freq can in general be arbitrarily large...

Lucene does not directly store the total number of terms in a
document, but it does store it approximately in the doc's norm value.
Maybe you can use that?  Alternatively, you can store this statistic
yourself, e.g as a doc value.

Mike McCandless

On Thu, Feb 5, 2015 at 7:24 PM, Ahmet Arslan <> wrote:
> Hello Lucene Users,
> I am traversing all documents that contains a given term with following code :
> Term term = new Term(field, word);
> Bits bits = MultiFields.getLiveDocs(reader);
> DocsEnum docsEnum = MultiFields.getTermDocsEnum(reader, bits, field, term.bytes());
> while (docsEnum.nextDoc() != DocsEnum.NO_MORE_DOCS) {
> array[docsEnum.freq()]++;
> // how to retrieve term count for this document?
>    xxxxx(docsEnum.docID(), field);
> }
> How can I get field term count values for these documents using Lucene 4.10.3?
> Is above code OK for traversing posting list of term?
> Thanks,
> Ahmet
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message