lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki>
Subject Re: Phrase Frequency For Analysis
Date Thu, 22 Jun 2006 18:27:39 GMT
Nader Akhnoukh wrote:
> Yes, Chris is correct, the goal is to determine the most frequently 
> occuring
> phrases in a document compared to the frequency of that phrase in the
> index.  So there are only output phrases, no inputs.
> Also performance is not really an issue, this would take place on an
> irregular basis and could run overnight if need be.
> So it sounds like the best approach would be to index all 1, 2, and 3 
> word
> phrases.  Does anyone know of an Analyzer that does this?  And if I can
> successfully index the phrases would the term frequency vector contain 
> all
> the combination of phrases as terms along with their frequencies?

It's straightforward to implement one that keeps the last three tokens 
in a fifo buffer, and then outputs the whole buffer as a phrase.

You will need to store these phrases as terms, i.e. the stuff you return 
in your TokenStream should be this:

term1: token1 token2
term2: token1 token2 token3
term3: token2 token3
term4: token2 token3 token4

> Andrzej,  can you discuss your approach in a little more detail.  Are you
> suggesting manually traversing each document and doing a search on each
> phrase?  That seems very intensive as I have tens of thousands of 
> documents.

Ah, if you can afford the reindexing, then indexing word n-grams would 
be indeed a better option. I had to deal with existing indexes, so I 
resorted to a "dirty hack" way. It's not based on searching, but rather 
on traversing the TermPositions, using IndexReader.termPositions(term). 
It is very intensive, but for tens of thousands docs it is still 
feasible within a couple hours time - which may be still quicker than 
reindexing ...

Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration  Contact: info at sigram dot com

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message