lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon Willnauer <simon.willna...@googlemail.com>
Subject Re: IndexDocValues and storing Stats
Date Wed, 04 Jan 2012 14:37:31 GMT
Hey,

On Wed, Jan 4, 2012 at 1:15 PM, Hany Azzam <hany@eecs.qmul.ac.uk> wrote:
> Hi,
>
> I am experimenting with the Lucene trunk (aka 4.0), especially with the new IndexDocValues
feature. I am trying to store some query-independent statistics such as PageRank, etc. One
stat that I am trying to store is the sum of all the term frequencies in a document. This
can be seen as the document length. Is there a way to pre-compute this sum while performing
the indexing?

Lucene is already computing the length of the document in its
FieldInvertedState which is passed to similarity ie. look at
Similarity#computeNorms.  Currently the norm value is a single byte
but I am working  on exposing this via DocValues so you can store
custom data in your similarity.

simon
>
> Thank you,
> h.
>
>
>
>> TermVectors are still available in Lucene trunk aka 4.0, we just changed the implementation
of them to fit the general Lucene Terms/Fields/… APIs. TermVectors (if enabled in the document
during indexing) are simply something like a small index per document written to disk like
a stored field (it has nothing to do with DocValues, because you mentioned this). Theoretically,
you can execute a query against the small “TermVectors Index” and get exactly one hit
or no hit, if the query matches this document. This is e.g. used for highlighting if TV are
enabled. To support this “TV as a small index”, the old API was removed and the new TermVectors
API returns the same Terms/TermsEnum/DocsEnum APIs like IndexReader for a complete index,
but all structures simply return one document (ID=0) and corresponding term frequencies/doc
frequencies.
>>
>> To have some example code how to use it, review the Lucene testcases, some example:
>>
>>     Terms result = reader.getTermVectors(docId).terms(DocHelper.TEXT_FIELD_2_KEY);
>>     assertNotNull(result);
>>     assertEquals(3, result.getUniqueTermCount());
>>     TermsEnum termsEnum = result.iterator(null);
>>     while(termsEnum.next() != null) {
>>       String term = termsEnum.term().utf8ToString();
>>       int freq = (int) termsEnum.totalTermFreq();
>>       assertTrue(freq > 0);
>>     }
>>
>>     Fields results = reader.getTermVectors(docId);
>>     assertTrue(results != null);
>>     assertEquals("We do not have 3 term freq vectors", 3, results.getUniqueFieldCount());
>>
>> Uwe
>>
>> -----
>> Uwe Schindler
>> H.-H.-Meier-Allee 63, D-28213 Bremen
>> http://www.thetaphi.de
>> eMail: uwe@thetaphi.de
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message