lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Elschot <>
Subject Re: Phrase Frequency For Analysis
Date Thu, 22 Jun 2006 07:01:11 GMT
On Thursday 22 June 2006 01:33, Nader Akhnoukh wrote:
> Hi, I've looked through the archives and it looks like this question has
> been asked in one form or another a few times, but without a satisfactory
> solution.
> I am trying to get the most frequently occurring phrases in a document and
> in the index as a whole.  The goal is compare the two to get something like
> Amazon's SIPs.
> This is straightforward for individual words.  Get the term frequency of
> each term in a doc and compare it to the frequency of that term in the
> index.  A high ratio indicates that the term appears in this doc much more
> than the other docs on average.
> Does anyone have an idea of how to do this with phrases of say 1 to 3 words?
> Just to be clear,  in this case I am only using Lucene for it's built in
> frequency analysis, I'm not actually using it to search for anything that is
> indexed.

Other than indexing the phrases directly, you could use a SpanNearQuery
over the words, use getSpans() on its SpanScorer and count the number
of times next() on this Spans returns true.
This counts the total number of occurrences of the "phrase" in the index.
Eeach time doc() on the Spans returns a given document number,
one can increase the phrase frequency count within the document.
A Spans always iterates by non decreasing document number.
Btw. that is a search.

Paul Elschot

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message