Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 59522 invoked from network); 8 Feb 2006 19:19:46 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 8 Feb 2006 19:19:46 -0000 Received: (qmail 85222 invoked by uid 500); 8 Feb 2006 19:19:40 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 85193 invoked by uid 500); 8 Feb 2006 19:19:39 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 85182 invoked by uid 99); 8 Feb 2006 19:19:39 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Feb 2006 11:19:39 -0800 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: local policy) Received: from [64.78.21.129] (HELO mis011-2.exch011.intermedia.net) (64.78.21.129) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Feb 2006 11:19:38 -0800 Received: from ehost011-3.exch011.intermedia.net ([64.78.21.96]) by mis011-2.exch011.intermedia.net with Microsoft SMTPSVC(6.0.3790.1830); Wed, 8 Feb 2006 11:19:17 -0800 X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="----_=_NextPart_001_01C62CE4.8C20A64E" Subject: RE: How to get mapping of query terms to number of their occurrences in a doc? Date: Wed, 8 Feb 2006 11:17:11 -0800 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: How to get mapping of query terms to number of their occurrences in a doc? Thread-Index: AcYs2ioIzqH0yEM6Q/qM0BDOPFa67QAChjlF References: From: "Dmitry Goldenberg" To: X-OriginalArrivalTime: 08 Feb 2006 19:19:17.0688 (UTC) FILETIME=[8E38A380:01C62CE4] X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N ------_=_NextPart_001_01C62CE4.8C20A64E Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Duh! Bingo! Mistery solved. I should have thought of this :) The discrepancies come in with larger documents, definitely > 10K terms = which is Lucene's default maxFieldLength. =20 Thanks for your help, Chris =20 - Dmitry ________________________________ From: Chris Hostetter [mailto:hossman_lucene@fucit.org] Sent: Wed 2/8/2006 10:04 AM To: java-user@lucene.apache.org Subject: RE: How to get mapping of query terms to number of their = occurrences in a doc? : That's what I did, for debugging. The query is "biology", and here's : what the API tells me for term frequencies: : biolog 15 : biologi 31 : biologist 4 : : I actually see 13 occurrences of "biologist" and "biologists", 64 : occurrences of "biology", 27 occurrences of "biological". : : I see "inform 22" but the actual count of the word "information" in = the : document is 33. But "ioniz 7" is correct. I think I missunderstood what you ment when you said the counts don't match up. Are you comparing the number you get from that code with the number of times you personally see the word in the source document = before it has been analyzed? If so, then there could be a couple of things going on ... i would start by using a tool like Luke to see the actual lists of Terms for each doc = -- there may be something else your analyzer is doing that you don't = realize. It's also possible that you are hitting the maxFieldLength in the IndexWriter ... when that happens IndexWriter throws away any remaining tokens, so if your documenst are really large. Lastly, I would add a *lot* more debugging to your code. Print out the contents of "terms", when you loop over "tfvs" print out the field and = the full list of strTerms, in the inner most loop when you incriment the count, print out the field/text/and count. that's the best advise i have for spotting what's wrong. : ________________________________ : : From: Chris Hostetter [mailto:hossman_lucene@fucit.org] : Sent: Tue 2/7/2006 4:10 PM : To: java-user@lucene.apache.org : Subject: Re: How to get mapping of query terms to number of their = occurrences in a doc? : : : : : A cursory reading of your code looks ok ... stemming shouldn't be an = issue : as long as your measure of success is comparing docs that match your : orriginal query with the counts you get out. : : What i mean by that is that any stemming should have already been = taken : care of when your query object was constructed (either by you = manually, or : by QueryParser). the direct equals comparisons you are dong should be : fine. : : have you tried adding logging of the raw term field/text and the freq : counts you get back to see if that helps you spot the problem? : : : : Date: Mon, 6 Feb 2006 14:34:05 -0800 : : From: Dmitry Goldenberg : : Reply-To: java-user@lucene.apache.org : : To: java-user@lucene.apache.org : : Subject: How to get mapping of query terms to number of their = occurrences : : in a doc? : : : : Given a query, I want to be able to, for each query term, get the = number of occurrences of the term. I have tried what I'm including = below and it does not seem to provide reliable results. Seems to work = fine with exact matching but as soon as stemming kicks in, all bets are = off as to value of the number of occurrences returned. : : : : Any ideas, anyone? Can this be written in a simpler and/or more = efficient way? : : Thanks - : : : : int totalOccurrences =3D 0; : : : : reader =3D IndexReader.open(getDirectory(indexDirPath)); : : HashSet terms =3D new HashSet(); : : query.extractTerms(terms); : : : : TermFreqVector[] tfvs =3D reader.getTermFreqVectors(docId); : : if (tfvs !=3D null) { : : : : // For each term frequency vector (i.e. for each field) : : for (int i =3D 0; i < tfvs.length; i++) { : : String field =3D tfvs[i].getField(); : : String[] strTerms =3D tfvs[i].getTerms(); : : int[] tfs =3D tfvs[i].getTermFrequencies(); : : : : if (strTerms !=3D null) { : : : : // For each term in the query : : for (Iterator iter =3D terms.iterator(); = iter.hasNext();) { : : : : Term term =3D (Term) iter.next(); : : // For each term in the vector : : for (int j =3D 0; j < strTerms.length; j++) { : : : : // If found the query term among the vector terms : : if (field.equals(term.field()) && = strTerms[j].equals(term.text())) { : : : : // Add the term frequency to the total : : totalOccurrences +=3D tfs[j]; : : : : } : : } : : } : : } : : } : : } : : : : : : -Hoss : : : --------------------------------------------------------------------- : To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org : For additional commands, e-mail: java-user-help@lucene.apache.org : : : : : -Hoss --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org ------_=_NextPart_001_01C62CE4.8C20A64E Content-Type: text/plain; charset=us-ascii --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org ------_=_NextPart_001_01C62CE4.8C20A64E--