It seems like there should be a formula for estimating the total
number of unique terms given that you know the unique term counts for
each segment, and make certain assumptions like random document
distribution across segments.
Yonik
http://www.lucidimagination.com
On Thu, May 27, 2010 at 9:17 PM, kannan chandrasekaran
<ckannanck@yahoo.com> wrote:
> I am just trying out a few experiments to calculate similarity between terms based on
their cooccurences in the dataset... Basically I am trying to build contextual vectors
and calculate similarity using a similarity measure ( say cosine similarity).....
>
> I dont think this is an XY problem . The vectors I am trying to build are not the same
as the TermVectors option ((term,freq) pairs per document) in the lucene ( if thats what u
meant)
>
> Thanks
> Kannan

To unsubscribe, email: javauserunsubscribe@lucene.apache.org
For additional commands, email: javauserhelp@lucene.apache.org
