lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: Counting term frequency without using Explanation
Date Mon, 19 Feb 2007 16:09:25 GMT
Just search the mail archives for the threads mentioned... start at
...http://www.gossamer-threads.com/lists/lucene/java-user/


But if all you want to do is count terms, see the termdocs/termenum classes
in the Lucene api...

But I believe that term frequency is *already* part of the relevance
calculation provided out of the box. Are you sure that the default relevancy
calculations aren't already doing this for you?

Best
Erick

On 2/19/07, Harini Raghavan <harini.raghavan@insideview.com> wrote:
>
> Hi Erick,
>
> I have a similar requirement to know the frequency of occurrence of a
> keyword in a given content to find out the relevancy of the article to a
> set
> of keywords. If the keyword is mentioned more than once in the article,
> then
> I want to treat it as more relevant.
>
> Can you please point me to the threads you have mentioned?
>
> Thanks,
> Harini
>
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Wednesday, February 07, 2007 8:19 PM
> To: java-user@lucene.apache.org
> Subject: Re: Counting term frequency without using Explanation
>
> Before you go too far down this path, please consider what a "hit" is.
> It's
> more complicated than you think <G>.
>
> If all you want to do is count up the number of times any term appears in
> the document, it's not too hard. You should be able to use a
> termenum/termdocs process to count them.
>
> TermDocs should work, just seek to a term, skip to the document number
> (which you'll have to get somewhere else), and keep adding to your count
> while the docid is the same as your target. Repeat for each term.
>
> But it's a much more complicated story if you want to accurately reflect a
> query. For instance, consider a near query, that is terms within, say, 3
> of
> each other. If you do something like the above, you'll present "hits" that
> aren't real. For instance...
>
> a b c d e f g h i j a
>
> if you search for a and c within 3 of each other, is this one hit? two? it
> definitely isn't three which is what you'd get if you just counted the
> occurrence of the terms a, b... What about a NOT clause? How does a phrase
> query get counted?
>
> There have been several discussions of various aspects of this issue, but
> often in the context of highlighting. You'll probably get some good
> information from the following threads...
>
> Counting terms' hits from phrases
> Counting hits in a document
>
> as well as searching the archive on highlighting and/or hitcount
>
> Best
> Erick
>
>
>
>
> On 2/7/07, csahat <csahat@gmail.com> wrote:
> >
> > Hi all,
> >
> >   I'm so sorry if this question already answered before in this list,
> > but I already search the list, and I couldn't find the answer.
> >
> >    This is what I want to do :
> >
> >   When the user type in the query, for example "WebSphere Java",
> > Lucene will show not only the score, but showing the term count per
> > document as well, like this
> >
> >   doc1    0.8333          websphere=3, Java = 2
> >   doc2    0.817            websphere=2, Java=2
> >
> >
> >   I already tried to implement with TermFreqVector, but TermFreqVector
> > show all the terms in the field, instead what I want is only the terms
> > that happen in the query.
> > I already tried using TermDocs as well, but it always gave result 0.
> >
> >   I tried using Explanation class, using toString method, but I have
> > to "clean"
> > the information.
> >
> >
> >   Is there any "direct" way to do this in Lucene ?  Or perhaps someone
> > can give me a hint ?
> >
> >   Thanks in advance
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message