Return-Path: Delivered-To: apmail-lucene-general-archive@www.apache.org Received: (qmail 13509 invoked from network); 10 Aug 2009 03:46:05 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 10 Aug 2009 03:46:05 -0000 Received: (qmail 75242 invoked by uid 500); 10 Aug 2009 03:46:11 -0000 Delivered-To: apmail-lucene-general-archive@lucene.apache.org Received: (qmail 75141 invoked by uid 500); 10 Aug 2009 03:46:11 -0000 Mailing-List: contact general-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@lucene.apache.org Delivered-To: mailing list general@lucene.apache.org Received: (qmail 75131 invoked by uid 99); 10 Aug 2009 03:46:11 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Aug 2009 03:46:11 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of kyliemccormick@gmail.com designates 209.85.218.214 as permitted sender) Received: from [209.85.218.214] (HELO mail-bw0-f214.google.com) (209.85.218.214) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Aug 2009 03:46:01 +0000 Received: by bwz10 with SMTP id 10so1099336bwz.5 for ; Sun, 09 Aug 2009 20:45:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:date:message-id:subject :from:to:content-type; bh=mHaN91M/I4iYH+Si0yHSFq1E3Luz2pWO2U42LYigUug=; b=h/cltYwoLwZAzS2mpcru/a/K611rosUrdtfWP8ez4TE2jtTEdqLl4w2VLLV6HcURVN PzzzS7om8NvuEd7gZrUPPqgeTni2MyFeL5wO9jLJv0b+TN+MwZeDJ988raOuZdb6LzeL vYVLbyiZUDIu14Rw2n62p1vEcXBno6o5EQ2gA= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:date:message-id:subject:from:to:content-type; b=LN66Rx9smqaqLO7UeeU7KGr/szgPMGrqy+rys5fkZsWB9Sz8Keu4cxqTdgjLqyx3k+ QuD5990efdzzm33ZBeFfjYkpzeQd430OGpafljytfjdskomRjZuCIS75K5tdNzDbBkCq xD4DLN3f2ZbVRE2UPHnvBQOKgpTRvQVkN2NKU= MIME-Version: 1.0 Received: by 10.239.159.208 with SMTP id z16mr424124hbc.8.1249875941019; Sun, 09 Aug 2009 20:45:41 -0700 (PDT) Date: Sun, 9 Aug 2009 19:45:41 -0800 Message-ID: <3b61738b0908092045m24584b3ep5426b18ad3b2f4f6@mail.gmail.com> Subject: Terms-Across-All-Documents From: "K. M. McCormick" To: general@lucene.apache.org Content-Type: multipart/alternative; boundary=001485f5ce6a7a5bb90470c16bdf X-Virus-Checked: Checked by ClamAV on apache.org --001485f5ce6a7a5bb90470c16bdf Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Hello There: I am currently working on an INDEX STAT GENERATOR I'd like to use for some term-weight tests in a (rather large) Lucene Index. In general, the stats I'm hoping to work with are based on a term's frequency across the entire indexed document set. TFIDF easily works in Lucene's searcher - and you can get access to a Term's DF (across all documents, obviously) quite easily. However, TF in Lucene seems limited to a by-document basis. Meaning, to generate the number of times this term has appeared in the indexed document set, I would have to (hypothetically) do the following: - Given Term t, find TF(t) - Get the enumeration of t over the index - TermDocs (so I have doc, freq pairings) - For each (doc, freq) pair, add freq to the total-index-frequency So if I have x terms, I would be iterating through x*TF(t) for the entire index to find out the index-frequency for all terms. Is this the only method of getting this information? Since my data set (and term set) are quite large, I was trying to find if there was another mechanism in place for Lucene, either at the indexing or the searching level. However, I've had little luck sifting through the information I've gotten (mostly points me to TFIDF) to find out if Lucene has something I can use to make this process faster. I have also read a bit about TermVectors, but those seem by-document as well. If there isn't a method at the search level (or, after-index-complete-level), I would be willing to accept the overhead of generating these stats at indexing time, if that would be more efficient... Thanks, drago --001485f5ce6a7a5bb90470c16bdf--