lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hany Azzam <h...@eecs.qmul.ac.uk>
Subject Re: IndexDocValues and storing Stats
Date Wed, 04 Jan 2012 12:15:13 GMT
Hi,

I am experimenting with the Lucene trunk (aka 4.0), especially with the new IndexDocValues
feature. I am trying to store some query-independent statistics such as PageRank, etc. One
stat that I am trying to store is the sum of all the term frequencies in a document. This
can be seen as the document length. Is there a way to pre-compute this sum while performing
the indexing?

Thank you,
h.



> TermVectors are still available in Lucene trunk aka 4.0, we just changed the implementation
of them to fit the general Lucene Terms/Fields/… APIs. TermVectors (if enabled in the document
during indexing) are simply something like a small index per document written to disk like
a stored field (it has nothing to do with DocValues, because you mentioned this). Theoretically,
you can execute a query against the small “TermVectors Index” and get exactly one hit
or no hit, if the query matches this document. This is e.g. used for highlighting if TV are
enabled. To support this “TV as a small index”, the old API was removed and the new TermVectors
API returns the same Terms/TermsEnum/DocsEnum APIs like IndexReader for a complete index,
but all structures simply return one document (ID=0) and corresponding term frequencies/doc
frequencies.
>  
> To have some example code how to use it, review the Lucene testcases, some example:
>  
>     Terms result = reader.getTermVectors(docId).terms(DocHelper.TEXT_FIELD_2_KEY);
>     assertNotNull(result);
>     assertEquals(3, result.getUniqueTermCount());
>     TermsEnum termsEnum = result.iterator(null);
>     while(termsEnum.next() != null) {
>       String term = termsEnum.term().utf8ToString();
>       int freq = (int) termsEnum.totalTermFreq();
>       assertTrue(freq > 0);
>     }
>  
>     Fields results = reader.getTermVectors(docId);
>     assertTrue(results != null);
>     assertEquals("We do not have 3 term freq vectors", 3, results.getUniqueFieldCount());
    
>  
> Uwe
>  
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>  

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message