lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hany Azzam <>
Subject Re: IndexDocValues and storing Stats
Date Wed, 04 Jan 2012 14:58:36 GMT
Hi Simon,

Thank you for your reply. The document length is just an example of what I need to store.
Another stat that I need is a *normalised* sum of the TF's. I can compute this using my own
cache during retrieval by extending the SimilarityBase and storing the values in a cache that
is used whenever the score method is invoked. However, I am trying to push this to the index
in order to make it more efficient, and as I said earlier I haven't found a way to do this

With regard to document length (DL) yes you are right, but unfortunately Lucene doesn't provide
the raw (real) document length (as far as I know). It only provides the encoded/decoded DL.
I read on the forum (and from my own experiments) that the difference in quality when  implementing
a similarity function using the raw DL versus implementing the same function but with Lucene's
exposed (encoded/decoded) DL is not statistically significant. However, I still prefer to
use the raw DL, and that's why I use the sum of the TF's in a document to cache it.


On 4 Jan 2012, at 14:37, Simon Willnauer wrote:

> Hey,
> On Wed, Jan 4, 2012 at 1:15 PM, Hany Azzam <> wrote:
>> Hi,
>> I am experimenting with the Lucene trunk (aka 4.0), especially with the new IndexDocValues
feature. I am trying to store some query-independent statistics such as PageRank, etc. One
stat that I am trying to store is the sum of all the term frequencies in a document. This
can be seen as the document length. Is there a way to pre-compute this sum while performing
the indexing?
> Lucene is already computing the length of the document in its
> FieldInvertedState which is passed to similarity ie. look at
> Similarity#computeNorms.  Currently the norm value is a single byte
> but I am working  on exposing this via DocValues so you can store
> custom data in your similarity.
> simon
>> Thank you,
>> h.
>>> TermVectors are still available in Lucene trunk aka 4.0, we just changed the
implementation of them to fit the general Lucene Terms/Fields/… APIs. TermVectors (if enabled
in the document during indexing) are simply something like a small index per document written
to disk like a stored field (it has nothing to do with DocValues, because you mentioned this).
Theoretically, you can execute a query against the small “TermVectors Index” and get exactly
one hit or no hit, if the query matches this document. This is e.g. used for highlighting
if TV are enabled. To support this “TV as a small index”, the old API was removed and
the new TermVectors API returns the same Terms/TermsEnum/DocsEnum APIs like IndexReader for
a complete index, but all structures simply return one document (ID=0) and corresponding term
frequencies/doc frequencies.
>>> To have some example code how to use it, review the Lucene testcases, some example:
>>>     Terms result = reader.getTermVectors(docId).terms(DocHelper.TEXT_FIELD_2_KEY);
>>>     assertNotNull(result);
>>>     assertEquals(3, result.getUniqueTermCount());
>>>     TermsEnum termsEnum = result.iterator(null);
>>>     while( != null) {
>>>       String term = termsEnum.term().utf8ToString();
>>>       int freq = (int) termsEnum.totalTermFreq();
>>>       assertTrue(freq > 0);
>>>     }
>>>     Fields results = reader.getTermVectors(docId);
>>>     assertTrue(results != null);
>>>     assertEquals("We do not have 3 term freq vectors", 3, results.getUniqueFieldCount());
>>> Uwe
>>> -----
>>> Uwe Schindler
>>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>> eMail:
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message