lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Phil Whelan <phil...@gmail.com>
Subject Re: Weird discrepancy with term counts vs. terms (off by 1)
Date Sun, 02 Aug 2009 16:08:38 GMT
Hi Jim,

On Sun, Aug 2, 2009 at 1:32 AM, <ohaya@cox.net> wrote:
> I first noticed the problem that I'm seeing while working on this latter app.  Basically,
what I noticed was that while I was adding 13 documents to the index, when I listed the "path"
terms, there were only 12 of them.

Field text (the whole "path" in your case) and terms (the tokens of
the field text) are different.

The StandardAnalyzer breaks up words like this...
Field text = "/a/b/c.txt"
Tokens = {"a","b","c","txt"}

So this 1 field of 1 document become 4 terms / tokens (not sure if
there is a difference in this terminology between "terms" and "tokens"
sorry).
Therefore, you're going to have more terms than documents initially,
but as the overlap in term usage increases this changes.

For instance, these 3 paths
"/a/b/c/d.txt","/b/c/d/a.txt","/c/d/a/b.txt" are still only a total of
4 terms, since they share the same terms.

In fact, StandardAnalyzer goes a bit further than that and removes
"stop-words", such as "a" (or "an", "the") as it's designed for
general text searching.

That said, I think you have a point with the next part of your question...

> So then, I reviewed the index using Luke, and what I saw with that was that there were
indeed only 12 "path" terms (under "Term Count" on the left), but, when I clicked the "Show
Top Terms" in Luke, there were 13 terms listed by Luke.

Yes, I just checked this and this seems to be a bug with Luke. It
always shows 1 less than in "Term Count" than it should. Well spotted.

Cheers,
Phil

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message