lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From <oh...@cox.net>
Subject Re: Weird discrepancy with term counts vs. terms (off by 1)
Date Sun, 02 Aug 2009 19:12:12 GMT
Hi Phil,

For problem with my app, it wasn't what you suggested (about the tokens, etc.).

For some later things, my indexer creates both a "path" field that is analyzed (and thus tokenized,
etc.) and another field, "fullpath", which is not analyzed (and thus, not tokenized).

The problem with my app was that I was create a TermEnum:

Term term = new Term("fullpath", "");
termsEnumerator = reader.terms(term);

and then going immediately into a while loop:

while (termsEnumerator.next()) {
.
.
}

i.e., I was ignoring the 1st term in the TermEnum (since the .next() bumps the TermEnum to
the 2nd term, initially).

Anyway, so the code that I ended up with is:

try {
System.out.println("Outside while: About to get 1st termsEnumerator.term()...");
currentTerm = termsEnumerator.term();
currentField = currentTerm.field();
termpathcount++;
System.out.println("Outside while: 1st Field = [" + currentField + "] Term = [" + currentTerm.text()
+ "]");
System.out.println("Outside while: About to drop into while()...");
while (termsEnumerator.next()) {
	currentTerm = termsEnumerator.term();
	currentField = currentTerm.field();
	if (currentField.equalsIgnoreCase("fullpath")) {
		termpathcount++;
		System.out.println("Count=" + termpathcount + " Field = [" + currentField + "] Term = ["
+ currentTerm.text() + "]");
	}
} // end while()

termsEnumerator.close();
System.out.println("Matching terms count = " + termpathcount);
} catch (Exception e) {
	System.out.println("** ERROR **: Exception while stepping through index: [" + e + "]");
	e.printStackTrace();
	}

and, that seems to be working perfectly.

Also, thanks for following up re. that Luke problem.  That was one piece of this "puzzle"
that was kind of driving me batty :)!!

Jim



---- Phil Whelan <phil123@gmail.com> wrote: 
> Hi Jim,
> 
> On Sun, Aug 2, 2009 at 1:32 AM, <ohaya@cox.net> wrote:
> > I first noticed the problem that I'm seeing while working on this latter app.  Basically,
what I noticed was that while I was adding 13 documents to the index, when I listed the "path"
terms, there were only 12 of them.
> 
> Field text (the whole "path" in your case) and terms (the tokens of
> the field text) are different.
> 
> The StandardAnalyzer breaks up words like this...
> Field text = "/a/b/c.txt"
> Tokens = {"a","b","c","txt"}
> 
> So this 1 field of 1 document become 4 terms / tokens (not sure if
> there is a difference in this terminology between "terms" and "tokens"
> sorry).
> Therefore, you're going to have more terms than documents initially,
> but as the overlap in term usage increases this changes.
> 
> For instance, these 3 paths
> "/a/b/c/d.txt","/b/c/d/a.txt","/c/d/a/b.txt" are still only a total of
> 4 terms, since they share the same terms.
> 
> In fact, StandardAnalyzer goes a bit further than that and removes
> "stop-words", such as "a" (or "an", "the") as it's designed for
> general text searching.
> 
> That said, I think you have a point with the next part of your question...
> 
> > So then, I reviewed the index using Luke, and what I saw with that was that there
were indeed only 12 "path" terms (under "Term Count" on the left), but, when I clicked the
"Show Top Terms" in Luke, there were 13 terms listed by Luke.
> 
> Yes, I just checked this and this seems to be a bug with Luke. It
> always shows 1 less than in "Term Count" than it should. Well spotted.
> 
> Cheers,
> Phil
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message