lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <>
Subject Re: Counting term frequency without using Explanation
Date Wed, 07 Feb 2007 14:48:42 GMT
Before you go too far down this path, please consider what a "hit" is. It's
more complicated than you think <G>.

If all you want to do is count up the number of times any term appears in
the document, it's not too hard. You should be able to use a
termenum/termdocs process to count them.

TermDocs should work, just seek to a term, skip to the document number
(which you'll have to get somewhere else), and keep adding to your count
while the docid is the same as your target. Repeat for each term.

But it's a much more complicated story if you want to accurately reflect a
query. For instance, consider a near query, that is terms within, say, 3 of
each other. If you do something like the above, you'll present "hits" that
aren't real. For instance...

a b c d e f g h i j a

if you search for a and c within 3 of each other, is this one hit? two? it
definitely isn't three which is what you'd get if you just counted the
occurrence of the terms a, b... What about a NOT clause? How does a phrase
query get counted?

There have been several discussions of various aspects of this issue, but
often in the context of highlighting. You'll probably get some good
information from the following threads...

Counting terms' hits from phrases
Counting hits in a document

as well as searching the archive on highlighting and/or hitcount


On 2/7/07, csahat <> wrote:
> Hi all,
>   I'm so sorry if this question already answered before in this list, but
> I
> already search
> the list, and I couldn't find the answer.
>    This is what I want to do :
>   When the user type in the query, for example "WebSphere Java",
> Lucene will show not only the score, but showing the term count per
> document
> as well, like this
>   doc1    0.8333          websphere=3, Java = 2
>   doc2    0.817            websphere=2, Java=2
>   I already tried to implement with TermFreqVector, but TermFreqVector
> show
> all the
> terms in the field, instead what I want is only the terms that happen in
> the
> query.
> I already tried using TermDocs as well, but it always gave result 0.
>   I tried using Explanation class, using toString method, but I have to
> "clean"
> the information.
>   Is there any "direct" way to do this in Lucene ?  Or perhaps someone can
> give me a hint ?
>   Thanks in advance

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message