lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doron Cohen <>
Subject Re: Help regarding an Algo.
Date Fri, 13 Apr 2007 16:25:42 GMT
"sai hariharan" <> wrote on 13/04/2007 01:50:35:

> Hi to all,
> I've an algortihm thats given below, can anybody help me implementing it.
> Any sort of suggestion will be appreciated. I've finished removing stop
> words,
> calculating term frequencies with Lucene. The rest of the part is not
> clear.
> I'm working only on the English part.
> ....
> ....
>    - ‧      If the frequency of a word is less than two, it is
>    discarded. Next, the association rule is used to
>                 compute the confidence value as follows.
> .
> Confidence (A ⇒B) = P(A∩B) / P(A)
> If P(A) is 1 and the co-occurrence, P(A∩B), is also 1, then the resulting
> confidence value must be 1.
> Obviously, it must be greater than the threshold Clearly, P(A) cannot be
> as this will result in a
> division-by-zero [13].

This seems part of "Using Association Rules for Expanding Search Engine
Recommendation Keywords in English and Chinese Queries (2005)" - Y.P.
Huang, C.-A. Tsai (Taiwan), and F.E. Sandnes (Norway) - - and it would be
hard to help without reading the paper (which is not freely available...)

Anyhow from the abstract it seems this step attempts to infer whether two
consecutive words A B in a document should be treated as a phrase by basing
on the probability of B to appear after A in (I think) the entire
collection, or more likely in some training collection.

So for P(A) one could use the total occurrence count of A = sum termFreq(A)
over all documents containing A (again, in the training collection). Can
divide by a collection size factor (words count) to make this a [0..1]
probability. ...  And for P(A∩B) could similarly use the total number of
occurrences of B right after A in the entire training collection.  Mmmm...,
seems both values should be computed in advance in a preprocessing step for
all the candidate words - actually all the pairs of non stop words A B
(appearing at least twice) in the entire training collection.

I *guess* the probabilities are later used to expand user queries either
automatically for improved recall or just suggesting an expanded query via
UI, but this is mostly *guessing* as I don't know that paper... Wouldn't it
be best to contact the authors?
View raw message