From general-return-1881-apmail-lucene-general-archive=lucene.apache.org@lucene.apache.org Thu Dec 17 11:05:17 2009 Return-Path: Delivered-To: apmail-lucene-general-archive@www.apache.org Received: (qmail 20096 invoked from network); 17 Dec 2009 11:05:17 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 17 Dec 2009 11:05:17 -0000 Received: (qmail 72516 invoked by uid 500); 17 Dec 2009 11:05:16 -0000 Delivered-To: apmail-lucene-general-archive@lucene.apache.org Received: (qmail 72439 invoked by uid 500); 17 Dec 2009 11:05:16 -0000 Mailing-List: contact general-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@lucene.apache.org Delivered-To: mailing list general@lucene.apache.org Received: (qmail 72428 invoked by uid 99); 17 Dec 2009 11:05:16 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Dec 2009 11:05:16 +0000 X-ASF-Spam-Status: No, hits=-2.6 required=5.0 tests=AWL,BAYES_00 X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of aw@ice-sa.com designates 212.85.38.228 as permitted sender) Received: from [212.85.38.228] (HELO tor.combios.es) (212.85.38.228) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 17 Dec 2009 11:05:13 +0000 Received: from localhost (localhost [127.0.0.1]) by tor.combios.es (Postfix) with ESMTP id 454982260D7 for ; Thu, 17 Dec 2009 12:04:52 +0100 (CET) Received: from tor.combios.es ([127.0.0.1]) by localhost (tor.combios.es [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id P0ruDeXlzan3 for ; Thu, 17 Dec 2009 12:04:52 +0100 (CET) Received: from [192.168.245.129] (p549EA784.dip0.t-ipconnect.de [84.158.167.132]) by tor.combios.es (Postfix) with ESMTPA id D9D672260CA for ; Thu, 17 Dec 2009 12:04:51 +0100 (CET) Message-ID: <4B2A104D.4020100@ice-sa.com> Date: Thu, 17 Dec 2009 12:04:45 +0100 From: =?ISO-8859-1?Q?Andr=E9_Warnier?= Reply-To: Lucene mailing list User-Agent: Thunderbird 2.0.0.23 (Windows/20090812) MIME-Version: 1.0 To: general@lucene.apache.org Subject: Re: Frequency Term of Composite words References: <8120c3fa0912160734h48fde66aw24156439503d282a@mail.gmail.com> <8120c3fa0912170254u24666342kbe9fb7b38ac9cbcf@mail.gmail.com> In-Reply-To: <8120c3fa0912170254u24666342kbe9fb7b38ac9cbcf@mail.gmail.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit Antonio Calò wrote: > Hi Ted. > > Thank you very much for your feedback. > > I can see the term frequency for each term, but not fo couples or more term > togheter. > > An example: "the quick brown fox jumps over the lazy dog. But the big dog > was sleeping.So The lazy dog didn't see the fox" > > So, with your suggestion I'm able to find that tf("dog") = 2, > tf("fox")=3,... (the terms are composed by just a word). > > But it seems that TermFrequencyVector cannot answer to this: tf("lazy > dog")=2, tf("quick brown")=1. > > Unlikely I've been asked to retrieve the occurrence of a set of concept in a > document and I was trying to use lucene cause my simple mapping algorithm is > too slow :(. > > I'll try to see if I can do something with TermFreqVector, or with the > Analizer. OR I'll go to look for another way :) > > Antonio > > > > 2009/12/16 Ted Dunning > >> You need the term frequency vector. >> >> See here >> >> http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/index/IndexReader.html#getTermFreqVector%28int,%20java.lang.String%29 >> >> This is compatible in 3.0 as well: >> >> http://lucene.apache.org/java/3_0_0/api/core/org/apache/lucene/index/IndexReader.html#getTermFreqVector%28int,%20java.lang.String%29 >> >> Note the package change. >> >> >> On Wed, Dec 16, 2009 at 7:34 AM, Antonio Calò >> wrote: >> >>> I All >>> >>> I Hope that you can help me on this. >>> >>> I'm looking for a fast way to obtainf for a given word, its term >> frequency >>> (I mean how many times it is available in a single doc). I've looking >> into >>> mail archive and LIA (Lucene In Action) book and I found something like >>> this: >>> >>> IndexSearcher index = new IndexSearcher(invertedIndexinRam); >>> Term term = new Term("doc", "quick"); >>> int occurrence = index.docFreq(term); >>> >>> ok, occurrence contains the occurrences of the word "quick" into the >> index >>> (In my case the index will contain only one document example "the quick >>> brown fox jumps over the lazy dog"). In this case the occurrence will be >> 1. >>> :) >>> >>> But now I need to retrieve the occurrency of a composite word: as example >>> "quick brown fox" but I'm quite in trouble on how could I perform this. >>> I haven't even really started to use Lucene yet, but I follow this list. So just an unqualified idea : - assuming each word is indexed, along with its position in each item - assuming that you kept all the words, and did not strip out "stop words" - assuming that you have the list of items which contain all of the words composing your multi-word term - then you should be able to determine which items contain word 1 of your term in position n word 2 of your term in position n+1 etc..