lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nader Akhnoukh" <>
Subject Phrase Frequency For Analysis
Date Wed, 21 Jun 2006 23:33:30 GMT
Hi, I've looked through the archives and it looks like this question has
been asked in one form or another a few times, but without a satisfactory

I am trying to get the most frequently occurring phrases in a document and
in the index as a whole.  The goal is compare the two to get something like
Amazon's SIPs.

This is straightforward for individual words.  Get the term frequency of
each term in a doc and compare it to the frequency of that term in the
index.  A high ratio indicates that the term appears in this doc much more
than the other docs on average.

Does anyone have an idea of how to do this with phrases of say 1 to 3 words?

Just to be clear,  in this case I am only using Lucene for it's built in
frequency analysis, I'm not actually using it to search for anything that is


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message