Hello There:
I am currently working on an INDEX STAT GENERATOR I'd like to use for some
term-weight tests in a (rather large) Lucene Index. In general, the stats
I'm hoping to work with are based on a term's frequency across the entire
indexed document set.
TFIDF easily works in Lucene's searcher - and you can get access to a Term's
DF (across all documents, obviously) quite easily. However, TF in Lucene
seems limited to a by-document basis. Meaning, to generate the number of
times this term has appeared in the indexed document set, I would have to
(hypothetically) do the following:
- Given Term t, find TF(t)
- Get the enumeration of t over the index - TermDocs (so I have doc, freq
pairings)
- For each (doc, freq) pair, add freq to the total-index-frequency
So if I have x terms, I would be iterating through x*TF(t) for the entire
index to find out the index-frequency for all terms. Is this the only method
of getting this information?
Since my data set (and term set) are quite large, I was trying to find if
there was another mechanism in place for Lucene, either at the indexing or
the searching level. However, I've had little luck sifting through the
information I've gotten (mostly points me to TFIDF) to find out if Lucene
has something I can use to make this process faster.
I have also read a bit about TermVectors, but those seem by-document as
well.
If there isn't a method at the search level (or,
after-index-complete-level), I would be willing to accept the overhead of
generating these stats at indexing time, if that would be more efficient...
Thanks,
drago