lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject Re: Terms-Across-All-Documents
Date Mon, 10 Aug 2009 05:48:52 GMT
I think that a more reasonable approach for experiments like this is to
store statistics of the sort that you want as part of the indexing process.
That will give you complete flexibility to do what you need.

Then at retrieval time you can access and pass in term level information
into a custom similarity function.  That leaves you with a good barrier
between index-time and search-time, but still gives you any information that
you might like to use.  Having your data in a side file helps you avoid
having to deal with those aspects of Lucene that are highly oriented around
efficiency which are good in their place, but could make your research work
much more difficult in the exploratory phase.

On Sun, Aug 9, 2009 at 8:45 PM, K. M. McCormick <>wrote:

> Hello There:
> I am currently working on an INDEX STAT GENERATOR I'd like to use for some
> term-weight tests in a (rather large) Lucene Index. In general, the stats
> I'm hoping to work with are based on a term's frequency across the entire
> indexed document set.
> TFIDF easily works in Lucene's searcher - and you can get access to a
> Term's
> DF (across all documents, obviously) quite easily. However, TF in Lucene
> seems limited to a by-document basis. Meaning, to generate the number of
> times this term has appeared in the indexed document set, I would have to
> (hypothetically) do the following:
> - Given Term t, find TF(t)
> - Get the enumeration of t over the index - TermDocs (so I have doc, freq
> pairings)
> - For each (doc, freq) pair, add freq to the total-index-frequency
> So if I have x terms, I would be iterating through x*TF(t) for the entire
> index to find out the index-frequency for all terms. Is this the only
> method
> of getting this information?
> Since my data set (and term set) are quite large, I was trying to find if
> there was another mechanism in place for Lucene, either at the indexing or
> the searching level. However, I've had little luck sifting through the
> information I've gotten (mostly points me to TFIDF) to find out if Lucene
> has something I can use to make this process faster.
> I have also read a bit about TermVectors, but those seem by-document as
> well.
> If there isn't a method at the search level (or,
> after-index-complete-level), I would be willing to accept the overhead of
> generating these stats at indexing time, if that would be more efficient...
> Thanks,
> drago

Ted Dunning, CTO

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message