lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bob Carpenter <c...@alias-i.com>
Subject Re: Phrase Frequency For Analysis
Date Thu, 22 Jun 2006 19:59:32 GMT
Adding to this growing thread, there's really no reason to
index all the term bigrams, trigrams, etc.  It's not
only slow, it's very memory/disk intensive.  All you need
to do is two passes over the collection.

Pass One
--------
Collect counts of bigrams (or trigrams, or whatever -- if
size is an issue, bigrams can be done in one pass, trigrams
in the next [perhaps restricted to containing a significant
bigram]).

And this doesn't always need to be totally up to date,
depending on how quickly you want to spot new terms
and more importantly, how quickly you want to expire them.

The first pass doesn't need to look at everything.  If you have
a huge collection, just choose a representative sample
(probably just a set of randomly chosen docs unless there's
more structure to exploit).

You may even be able to skip pass one altogether
if you're going for both new terms and significant
phrases relative to the collection -- see my last
post on Lucene and SIPs.  For that, you can just use
the index.

Pass Two
--------
For each document, collect its bigrams, trigrams, etc.
in a counter and do the significance tests vs. the larger
counts.  At this stage, you might want to index the significant
terms in the doc so they're easy to pull out at search time
if that's when you want to return them.

First thing to Try
------------------
Ignore pass one, calculate significance ranking
per term per doc by:

Z("t1 t2",d) = docCount("t1 t2",d) - docSize(d) * collectionProb("t1 t2")
                ---------------------------------------------
                sqrt(docSize(d) * collectionProb("t1 t2") * (1 - collectionProb("t1 t2")))

where:

docCount("t1 t2",d) is the count of "t1 t2" in doc d
docSize(d) is the number of bigrams in d
           (though the number of terms is probably
            a close enough approximation)
collectionProb("t1 t2") = collectionProb("t1") * collectionProb("t2")
collectionProb("t") = collectionCount("t") / collectionSize
collectionCount("t") = count of term "t" in the collection
collectionSize = number of term instances (not types) in the collection

- Bob Carpenter
   Alias-i

Andrzej Bialecki wrote:
> Nader Akhnoukh wrote:
> 
>> Yes, Chris is correct, the goal is to determine the most frequently 
>> occuring
>> phrases in a document compared to the frequency of that phrase in the
>> index.  So there are only output phrases, no inputs.
>>
>> Also performance is not really an issue, this would take place on an
>> irregular basis and could run overnight if need be.
>>
>> So it sounds like the best approach would be to index all 1, 2, and 3 
>> word
>> phrases.  Does anyone know of an Analyzer that does this?  And if I can
>> successfully index the phrases would the term frequency vector contain 
>> all
>> the combination of phrases as terms along with their frequencies?
> 
> 
> It's straightforward to implement one that keeps the last three tokens 
> in a fifo buffer, and then outputs the whole buffer as a phrase.
> 
> You will need to store these phrases as terms, i.e. the stuff you return 
> in your TokenStream should be this:
> 
> term1: token1 token2
> term2: token1 token2 token3
> term3: token2 token3
> term4: token2 token3 token4
> ...
> 
>>
>> Andrzej,  can you discuss your approach in a little more detail.  Are you
>> suggesting manually traversing each document and doing a search on each
>> phrase?  That seems very intensive as I have tens of thousands of 
>> documents.
> 
> 
> Ah, if you can afford the reindexing, then indexing word n-grams would 
> be indeed a better option. I had to deal with existing indexes, so I 
> resorted to a "dirty hack" way. It's not based on searching, but rather 
> on traversing the TermPositions, using IndexReader.termPositions(term). 
> It is very intensive, but for tens of thousands docs it is still 
> feasible within a couple hours time - which may be still quicker than 
> reindexing ...

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message