Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C3BA1CFF6 for ; Sun, 6 May 2012 23:32:55 +0000 (UTC) Received: (qmail 50145 invoked by uid 500); 6 May 2012 23:32:53 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 50011 invoked by uid 500); 6 May 2012 23:32:53 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 50003 invoked by uid 99); 6 May 2012 23:32:53 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 06 May 2012 23:32:53 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of rcmuir@gmail.com designates 209.85.212.48 as permitted sender) Received: from [209.85.212.48] (HELO mail-vb0-f48.google.com) (209.85.212.48) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 06 May 2012 23:32:49 +0000 Received: by vbjk17 with SMTP id k17so1466653vbj.35 for ; Sun, 06 May 2012 16:32:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type:content-transfer-encoding; bh=D3RBEOodHaNkiggUXDmHkS/0kkGfOdt1wqPnlARy7jU=; b=KJd+QLz00epNzh9DhY608LVFLW1HcKWuT8bYKsCGLzXmqL4ux27Tn8ULSY81Gg93Hx zb+RD9b2ZLclybUpdMRihbqiQebWYat2EBh/GFtey5lv8u735jYaXSLiwLP/7IQatcru cTx3ljua0H/wvXqPI9rZuCX9br60Y/As+fMpVdOMbwLyRZ8BJQ1CLghWXyMgYcyCLgNz HUoYaEMoSNqwLRsFbeyKYGjl0TwTiIoZl5dCGlYdHVuZ3ryQPo905n2purXTDzbUUvUO bktr/mxGzZV5dHJF99iKJzFwak87myHu4RJNgfpgdjpUm7mOEaac+qC3nS0zeu88FDve 4AYA== Received: by 10.52.24.170 with SMTP id v10mr5880912vdf.74.1336347148333; Sun, 06 May 2012 16:32:28 -0700 (PDT) MIME-Version: 1.0 Received: by 10.220.233.140 with HTTP; Sun, 6 May 2012 16:32:08 -0700 (PDT) In-Reply-To: References: From: Robert Muir Date: Sun, 6 May 2012 19:32:08 -0400 Message-ID: Subject: Re: Calculating IDF value more efficiently To: java-user@lucene.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Look at IndexReader.docFreq On Fri, Apr 27, 2012 at 10:38 PM, Kasun Perera wrote= : > This is my program to calculate TF-IDF value for a document in a collecti= on > of documents. This is working fine, but takes lot of time when calculatin= g > the "IDF" values (finding the no of documents which contains particular > term). > > Is there a more efficient way of finding the no of documents which contai= ns > a particular term? > > freq =3D termsFreq.getTermFrequencies(); > > terms =3D termsFreq.getTerms(); > > int noOfTerms =3D terms.length; > > score =3D new float[noOfTerms]; > DefaultSimilarity simi =3D new DefaultSimilarity(); > > =C2=A0 =C2=A0 =C2=A0 =C2=A0for (i =3D 0; i < noOfTerms; i++) { > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0int noofDocsContainTerm =3D noOf= DocsContainTerm(terms[i]); > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0float tf =3D simi.tf(freq[i]); > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0float idf =3D simi.idf(noofDocsC= ontainTerm, noOfDocs); > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0score[i] =3D tf * idf ; > > =C2=A0 =C2=A0 =C2=A0 =C2=A0} > > //// > > public int noOfDocsContainTerm(String querystr) throws > CorruptIndexException, IOException, ParseException{ > > QueryParser qp=3Dnew QueryParser(Version.LUCENE_35, "docuemnt", new > StandardAnalyzer(Version.LUCENE_35)); > > Query q=3Dqp.parse(querystr); > > int hitsPerPage =3D docNames.length; //minumum number or search results > IndexSearcher searcher =3D new IndexSearcher(ramMemDir, true); > TopScoreDocCollector collector =3D TopScoreDocCollector.create(hitsPerPag= e, true); > > searcher.search(q, collector); > > ScoreDoc[] hits =3D collector.topDocs().scoreDocs; > > =C2=A0 =C2=A0return hits.length; > } > > > -- > Regards > > Kasun Perera --=20 lucidimagination.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org