Return-Path: Delivered-To: apmail-lucene-general-archive@www.apache.org Received: (qmail 40040 invoked from network); 5 Feb 2010 09:54:24 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 5 Feb 2010 09:54:24 -0000 Received: (qmail 47588 invoked by uid 500); 5 Feb 2010 09:54:23 -0000 Delivered-To: apmail-lucene-general-archive@lucene.apache.org Received: (qmail 47528 invoked by uid 500); 5 Feb 2010 09:54:23 -0000 Mailing-List: contact general-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@lucene.apache.org Delivered-To: mailing list general@lucene.apache.org Received: (qmail 47509 invoked by uid 99); 5 Feb 2010 09:54:23 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 05 Feb 2010 09:54:23 +0000 X-ASF-Spam-Status: No, hits=-2.8 required=10.0 tests=RCVD_IN_DNSWL_MED,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [64.18.2.219] (HELO exprod7og116.obsmtp.com) (64.18.2.219) by apache.org (qpsmtpd/0.29) with SMTP; Fri, 05 Feb 2010 09:54:15 +0000 Received: from source ([209.85.220.228]) by exprod7ob116.postini.com ([64.18.6.12]) with SMTP ID DSNKS2vqq2aGZUswql3PO39/n4DP+qFkY3j6@postini.com; Fri, 05 Feb 2010 01:53:55 PST Received: by mail-fx0-f228.google.com with SMTP id 28so3567943fxm.0 for ; Fri, 05 Feb 2010 01:53:47 -0800 (PST) MIME-Version: 1.0 Received: by 10.102.207.40 with SMTP id e40mr1606595mug.86.1265363626893; Fri, 05 Feb 2010 01:53:46 -0800 (PST) In-Reply-To: <4B6BE480.8070103@boozter.com> References: <4B6BE480.8070103@boozter.com> Date: Fri, 5 Feb 2010 10:53:46 +0100 Message-ID: <697f8381002050153s4a905a34m4f837780329cf51f@mail.gmail.com> Subject: Re: Document Frequency for a set of documents From: Ard Schrijvers To: general@lucene.apache.org, java-user@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable crossposting to the user list as I think this issue belongs there. See my comments inline On Fri, Feb 5, 2010 at 10:27 AM, lionel duboeuf wrote: > Hi, > > Sorry for asking again, **I still have not found a scalable solution to g= et > the document frequency of a term t according a set of documents. Lucene o= nly > store the document frequency for the global corpus, but i would like to b= e > able to get the document frequency of a term according only to a subset o= f > documents (i.e. a user's collection of documents). > > I guess that querying the index to get the number of hits for each term a= nd > for each field, =A0filtered by a user will be to slow. > Any idea ? I have recently developed out-of-the-box faceted navigation exposed over jcr (hippo repository on top of jackrabbit) where I think you are looking for efficient faceted navigation as well, right? First of all, I am also interested if others have something to add to my findings. First of all, you can approach your issue in two different angles, where I think depending on the number of results vs number of terms (unique facets), you can best switch (runtime) between the two approaches: Approach (1): The lucene TermEnum is leading: if the lucene field has *many* (say more then 100.000) unique values, it becomes slow (and approach two might be better) You have a BitSet matchingDocs, and you want the count for all the terms for field 'brand' where of course one of the documents in matchingDocs should have the term: Suppose your field is thus 'brand', then you can do: TermEnum termEnum =3D indexReader.terms(new Term("brand", "")); // iterate through all the values of this facet and see look at number of hits per term try { TermDocs termDocs =3D indexReader.termDocs(); // open termDocs only once, and use seek: this is more effi= cient try { do { Term term =3D termEnum.term(); int count =3D 0; if (term !=3D null && term.field() =3D=3D internalFacetName) { // interned comparison termDocs.seek(term); while (termDocs.next()) { if (matchingDocs.get(termDocs.doc())) { count++; } } if (count > 0) { if (!"".equals(term.text())) { facetValueCountMap.put(term.text(), new Count(count)); } } } else { break; } } while (termEnum.next()); } finally { termDocs.close(); } } finally { termEnum.close(); } Approach (2): matching docs are leading. All lucene fields that should be useable for your facet counts, must be indexed with TermVectors. This approach becomes slow when the matching docs grow > 100.000 hits. Then, you rather use approach (1) Create your own HitCollector, and have its hit method something like: public final void collect(final int docid, final float score) { try { if (facetMap !=3D null) { final TermFreqVector tfv =3D reader.getTermFreqVector(docid, internalName); if (tfv !=3D null) { for (int i =3D 0; i < tfv.getTermFrequencies().length; = i++) { addToFacetMap(tfv.getTerms()[i]); } Note that the HitCollector's are not advised for large hit sets, also see [= 1] This is how i currently have a really performant faceted navigation exposed as a jcr tree. If somebody has tried more ways, or something to add, I would be interested Regards Ard [1] http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/search/HitCol= lector.html > > > regards, > > Lionel > > * > * > > >