lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Uwe Schindler <>
Subject Re: get frequency of each term from a document
Date Sun, 20 Sep 2015 15:41:38 GMT

For term vectors enum the doc freq is always 1 and the term freq is the one from the document
you got term vectors.

Term vectors just implement the same interface, but they can be seen as a small index per
document. This is made like that to allow executing queries for highlighting on single document.


Am 20. September 2015 16:28:12 MESZ, schrieb Ziqi Zhang <>:
>Thanks but TermsEnum has two methods that returns frequency-related 
>info, both are corpus-level, not document specific:
>-docFreq() Returns the number of documents containing the current term.
>-totalTermFreq() Returns the total number of occurrences of this term 
>across all documents (the sum of the freq() for each doc that has this 
>However I will need document specific frequency, i.e., freq of term A
>Doc 1, 2, ... N
>On 20/09/2015 15:07, Uwe Schindler wrote:
>> Hi,
>> With the terms enum you can iterate over all terms. Each one returns
>its term frequency. Of course, you need to enable term vectors during
>indexing. The pattern how to use terms enum can be looked up at various
>places in Lucene source code. It's a very expert API but it is the way
>to go here.
>> Uwe
>> Am 20. September 2015 15:35:40 MESZ, schrieb Ziqi Zhang
>>> Hi
>>> Is it possible to get a list of terms within a document, and also TF
>>> each of these terms *in that document only*? (Lucene 5.3)
>>> IndexReader has a method "Terms getTermVector(int docID, String
>>> field)",
>>> which gives me a "Terms" object, on which I can get a TermsEnum. But
>>> do not know where to go then.
>>> thanks
>> --
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, 28213 Bremen
>Ziqi Zhang
>Research Associate
>Department of Computer Science
>University of Sheffield
>To unsubscribe, e-mail:
>For additional commands, e-mail:

Uwe Schindler
H.-H.-Meier-Allee 63, 28213 Bremen
  • Unnamed multipart/alternative (inline, 8-Bit, 0 bytes)
View raw message