Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 82129 invoked from network); 28 May 2010 12:20:12 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 28 May 2010 12:20:12 -0000 Received: (qmail 44679 invoked by uid 500); 28 May 2010 12:20:10 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 44576 invoked by uid 500); 28 May 2010 12:20:09 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 44566 invoked by uid 99); 28 May 2010 12:20:08 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 28 May 2010 12:20:08 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=FREEMAIL_FROM,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of yseeley@gmail.com designates 74.125.82.48 as permitted sender) Received: from [74.125.82.48] (HELO mail-ww0-f48.google.com) (74.125.82.48) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 28 May 2010 12:20:02 +0000 Received: by wwi18 with SMTP id 18so131924wwi.35 for ; Fri, 28 May 2010 05:19:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:sender:reply-to:received :in-reply-to:references:date:x-google-sender-auth:message-id:subject :from:to:content-type:content-transfer-encoding; bh=eegfEGgasKjG+XDHcPrNZmgtitJiGnMOwfUJ1vQGbN8=; b=LHTcVaMd2Ihp9sXAp1mfbAY6bjPmF7z8pghp+w5x+Gy4tCcwb+YtQN/TmaeAS2LZXR sxmAn2wZCNUhU0TgOZSHGA2NtbM72wvFsNMJF9G8Qu5GQEEVU7Preyn5h5nHpb6mCGbI fazb9R0Lp4kuBckMidoe9fS0cW1St9SJoDq6c= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:reply-to:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type :content-transfer-encoding; b=rDydKRoqadeBQ4/O5Rj8uPFbpGrmw8a4IPXEvYcCOPLxCSvts/sTiZJTy0Oiuq+EBf f/Sa5tk/gAUVEoVBO4Ucb1zkI5hNvONxuKaohydLRY291thMdpyhZZ1o6XIqkuey5Pzg ZwaLkg3Max6njGyj54k6yKdZgmBEhFbFjRsrk= MIME-Version: 1.0 Received: by 10.227.137.135 with SMTP id w7mr213434wbt.10.1275049181798; Fri, 28 May 2010 05:19:41 -0700 (PDT) Sender: yseeley@gmail.com Reply-To: yonik@lucidimagination.com Received: by 10.216.22.140 with HTTP; Fri, 28 May 2010 05:19:41 -0700 (PDT) In-Reply-To: <802151.58187.qm@web55203.mail.re4.yahoo.com> References: <802151.58187.qm@web55203.mail.re4.yahoo.com> Date: Fri, 28 May 2010 08:19:41 -0400 X-Google-Sender-Auth: XWvN59qTVBKyiDZwWluXhw-f690 Message-ID: Subject: Re: How to get the number of unique terms in the inverted index From: Yonik Seeley To: java-user@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org It seems like there should be a formula for estimating the total number of unique terms given that you know the unique term counts for each segment, and make certain assumptions like random document distribution across segments. -Yonik http://www.lucidimagination.com On Thu, May 27, 2010 at 9:17 PM, kannan chandrasekaran wrote: > I am just trying out a few experiments to calculate similarity between te= rms based on their co-occurences in the dataset... =A0Basically I am trying= to build contextual vectors =A0and calculate similarity using a similarity= measure ( say cosine similarity)..... > > I dont think this is an XY problem . The vectors I am trying to build are= not the same as the TermVectors option ((term,freq) pairs per document) in= the lucene ( if thats what u meant) > > Thanks > Kannan --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org