Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 43234 invoked from network); 14 Jul 2005 13:28:18 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 14 Jul 2005 13:28:18 -0000 Received: (qmail 86404 invoked by uid 500); 14 Jul 2005 13:27:51 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 86295 invoked by uid 500); 14 Jul 2005 13:27:50 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 86268 invoked by uid 99); 14 Jul 2005 13:27:50 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 14 Jul 2005 06:27:50 -0700 X-ASF-Spam-Status: No, hits=0.4 required=10.0 tests=DNS_FROM_RFC_ABUSE X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: local policy includes SPF record at spf.trusted-forwarder.org) Received: from [217.12.10.216] (HELO web26005.mail.ukl.yahoo.com) (217.12.10.216) by apache.org (qpsmtpd/0.29) with SMTP; Thu, 14 Jul 2005 06:27:46 -0700 Received: (qmail 39757 invoked by uid 60001); 14 Jul 2005 13:27:46 -0000 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.co.uk; h=Message-ID:Received:Date:From:Subject:To:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=eo5aQL0K66cHyLG7Xa3sLBGvaUnWo8QOZz98yrQqzi7l7OZETE42Yod7redo+jjHDx85Bysm0bWLdyGvS83WTDgFZPY0VWJOmf2mN/xuCGvbZU0rer1nn4ZHIpokT+uCUOdM/ZmfVx47m6Z93SRCBLOioB3C+6B0xez78MbgPdI= ; Message-ID: <20050714132746.39755.qmail@web26005.mail.ukl.yahoo.com> Received: from [193.36.230.96] by web26005.mail.ukl.yahoo.com via HTTP; Thu, 14 Jul 2005 14:27:46 BST Date: Thu, 14 Jul 2005 14:27:46 +0100 (BST) From: mark harwood Subject: Re: SIPs and CAPs To: java-user@lucene.apache.org In-Reply-To: <9AC9DCE2-332B-4824-BD8E-4EEA5D449702@ehatchersolutions.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: 8bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N > Do you just do this with terms or do you also > extract phrases? The scheme involves these phases: 1) Identify top terms (using algo described) 2) Identify all term "runs" in original text. 3) Identify sensible phrases from large list of term runs 4) Provide shortlist of top scoring terms AND phrases Step 1 is done as described in my earlier post. Step 2 I currently do be re-running an Analyzer on the original text. It is possible that this could be done using the RAMDirectory used in Step 1 and SpanQueries or some such but I have found it is important to resort to the original text to get sensible terms/phrases. If your indexed content used stemming and stop word removal and you *didn't* look at the original text you would identify phrases like "united state america" instead of "United States of America". Step 3 is needed to consolidate all of the learning about term usage. For example, the code may choose to collapse the run "United States Of America invades" into the shorter "United States" run because it occurs much less and all of the shorter run's terms are in the longer one. Step 4 ranks the phrases and terms to produce a shortlist consisting of both. Some terms are always used in phrases (so will not be selected as a single term). Some terms *never* appear in a phrase so are considered for shortlisting. There's probably a number of ways in which these different phases can be implemented but I've found them all to be necessary if you want to present the findings in a readable form to end-users. ___________________________________________________________ How much free photo storage do you get? Store your holiday snaps for FREE with Yahoo! Photos http://uk.photos.yahoo.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org