lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kamal Abou Mikhael <ka...@aub.edu.lb>
Subject Re: Phrase Frequency For Analysis
Date Thu, 22 Jun 2006 17:55:35 GMT
I may be coming into this thread without knowing enough.  I have implemented a
phrase filter, which indexes all token sequences that are 2 to N tokens long.
The n is defined in the constructor.

It takes a stopword Trie for input because the policy I used, based on a publish
work I read, was that a phrase should neither begin nor end with a stopword.
This is immaterial as one can simply provide a trie with nothing in it.

The Trie structure is found at http://www.graphbuilder.com/trie/
The SWPhraseFilter.java file is attached and is self explanatory.

I do believe that if you change the condition on linke 67 of phraseTerms.size()
> 1 to phraseTerms.size() > 0, you may get the result you need.  Sorry, I do
not have the time to test this out.

I have also included TriePhraseFilter.java.  When I indexed all the possible
phrases, I went back and dumped the ones that occured in more than 25 docs into
a file (this turned out to be about 2% of all candidates), then I placed these
in a Trie and reindexed the documents.

So I used this TriePhraseFilter to recognize the phrases I dumped into the file.
 The requirement of them being in 25 documents is also in the paper I that I
read about phrase indexing.

By the way, I am 90 percent certain of the TriePhraseFilter code.

Best Regards,
Kamal Abou Mikhael


Quoting Nader Akhnoukh <iamnader@gmail.com>:

> Yes, Chris is correct, the goal is to determine the most frequently occuring
> phrases in a document compared to the frequency of that phrase in the
> index.  So there are only output phrases, no inputs.
>
> Also performance is not really an issue, this would take place on an
> irregular basis and could run overnight if need be.
>
> So it sounds like the best approach would be to index all 1, 2, and 3 word
> phrases.  Does anyone know of an Analyzer that does this?  And if I can
> successfully index the phrases would the term frequency vector contain all
> the combination of phrases as terms along with their frequencies?
>
> Andrzej,  can you discuss your approach in a little more detail.  Are you
> suggesting manually traversing each document and doing a search on each
> phrase?  That seems very intensive as I have tens of thousands of documents.
>
> Thanks.
>


Mime
View raw message