Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 745 invoked from network); 14 Jul 2005 11:17:35 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 14 Jul 2005 11:17:35 -0000 Received: (qmail 61464 invoked by uid 500); 14 Jul 2005 11:17:26 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 61441 invoked by uid 500); 14 Jul 2005 11:17:26 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 61428 invoked by uid 99); 14 Jul 2005 11:17:25 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Jul 2005 04:17:25 -0700 X-ASF-Spam-Status: No, hits=0.4 required=10.0 tests=DNS_FROM_RFC_ABUSE X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: local policy includes SPF record at spf.trusted-forwarder.org) Received: from [217.12.10.216] (HELO web26005.mail.ukl.yahoo.com) (217.12.10.216) by apache.org (qpsmtpd/0.29) with SMTP; Thu, 14 Jul 2005 04:17:21 -0700 Received: (qmail 4598 invoked by uid 60001); 14 Jul 2005 11:17:21 -0000 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.co.uk; h=Message-ID:Received:Date:From:Subject:To:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=zI1Nlo80XQo5GZ5Xco12/epXazWiYLVMEDkGzIj//gQW81xt/JqmrFoIqIG4ps4wYbQgAzjYX3ufuL4B3KAu9EGCgUZiuqJdFgiQ4yirn4sIBoCR+kzEZrpvTsG+o3VkIu08jSgP7bk7ULNXGOxtqfNkjtffixxem3f6KhxFo9Y= ; Message-ID: <20050714111721.4596.qmail@web26005.mail.ukl.yahoo.com> Received: from [193.36.230.96] by web26005.mail.ukl.yahoo.com via HTTP; Thu, 14 Jul 2005 12:17:21 BST Date: Thu, 14 Jul 2005 12:17:21 +0100 (BST) From: mark harwood Subject: Re: SIPs and CAPs To: java-user@lucene.apache.org In-Reply-To: <1272C20F-FC23-4888-8FE8-5A96D5598A1F@ehatchersolutions.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N I've done this by comparing term frequency in a subset (in Amazon's case a single book) and looking for a significant "uplift" in term popularity vs that of the general corpus popularity. Practically speaking, in the amazon case you can treat each page in the example book as a Lucene document, create a RAMDirectory and then use it's TermEnum to get the docFreqs for all words and compare them with the corpus docFreqs. The "uplift" score for each term is (subsetDocFreq/subsetNumDocs)-(corpusDocFreq/corpusNumDocs) Take the top "n" terms scored by the above then analyze the text of the subset looking for runs of these terms. I have some code for this that I have wanted to package up as a contribution for some time. ___________________________________________________________ Yahoo! Messenger - NEW crystal clear PC to PC calling worldwide with voicemail http://uk.messenger.yahoo.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org