lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmitry Goldenberg" <dmitry.goldenb...@weblayers.com>
Subject RE: How to get mapping of query terms to number of their occurrences in a doc?
Date Wed, 08 Feb 2006 17:18:53 GMT
Chris,
 
That's what I did, for debugging.  The query is "biology", and here's what the API tells me
for term frequencies:
biolog 15
biologi 31
biologist 4

I actually see 13 occurrences of "biologist" and "biologists", 64 occurrences of "biology",
27 occurrences of "biological".

I see "inform 22" but the actual count of the word "information" in the document is 33.  But
"ioniz 7" is correct.

I don't see much correlation between what the API is telling me and what I see in the actual
document.  Am I missing something?

Thanks

________________________________

From: Chris Hostetter [mailto:hossman_lucene@fucit.org]
Sent: Tue 2/7/2006 4:10 PM
To: java-user@lucene.apache.org
Subject: Re: How to get mapping of query terms to number of their occurrences in a doc?




A cursory reading of your code looks ok ... stemming shouldn't be an issue
as long as your measure of success is comparing docs that match your
orriginal query with the counts you get out.

What i mean by that is that any stemming should have already been taken
care of when your query object was constructed (either by you manually, or
by QueryParser).  the direct equals comparisons you are dong should be
fine.

have you tried adding logging of the raw term field/text and the freq
counts you get back to see if that helps you spot the problem?


: Date: Mon, 6 Feb 2006 14:34:05 -0800
: From: Dmitry Goldenberg <dmitry.goldenberg@weblayers.com>
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: How to get mapping of query terms to number of their occurrences
:     in a doc?
:
: Given a query, I want to be able to, for each query term, get the number of occurrences
of the term.  I have tried what I'm including below and it does not seem to provide reliable
results.  Seems to work fine with exact matching but as soon as stemming kicks in, all bets
are off as to value of the number of occurrences returned.
:
: Any ideas, anyone?  Can this be written in a simpler and/or more efficient way?
: Thanks -
:
:       int totalOccurrences = 0;
:
:       reader = IndexReader.open(getDirectory(indexDirPath));
:       HashSet terms = new HashSet();
:       query.extractTerms(terms);
:
:       TermFreqVector[] tfvs = reader.getTermFreqVectors(docId);
:       if (tfvs != null) {
:
:         // For each term frequency vector (i.e. for each field)
:         for (int i = 0; i < tfvs.length; i++) {
:           String field = tfvs[i].getField();
:           String[] strTerms = tfvs[i].getTerms();
:           int[] tfs = tfvs[i].getTermFrequencies();
:
:           if (strTerms != null) {
:
:             // For each term in the query
:             for (Iterator iter = terms.iterator(); iter.hasNext();) {
:
:               Term term = (Term) iter.next();
:               // For each term in the vector
:               for (int j = 0; j < strTerms.length; j++) {
:
:                 // If found the query term among the vector terms
:                 if (field.equals(term.field()) && strTerms[j].equals(term.text()))
{
:
:                   // Add the term frequency to the total
:                   totalOccurrences += tfs[j];
:
:                 }
:               }
:             }
:           }
:         }
:       }
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org





Mime
View raw message