It seems like there should be a formula for estimating the total
number of unique terms given that you know the unique term counts for
each segment, and make certain assumptions like random document
distribution across segments.
> I am just trying out a few experiments to calculate similarity between terms based on
their cooccurences in the dataset... Basically I am trying to build contextual vectors
and calculate similarity using a similarity measure ( say cosine similarity).....
> I dont think this is an XY problem . The vectors I am trying to build are not the same
as the TermVectors option ((term,freq) pairs per document) in the lucene ( if thats what u
meant)
