lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Spencer, Dave" <>
Subject RE: Does it make sense to add Bayesian search to lucent?
Date Thu, 19 Sep 2002 20:18:05 GMT
Thanks for info, sounds interesting.
Yes, I'd like to see what you've done.
I'll try to post my 2 Bayesian "test" programs + my indexer
[the later
in case people want a 1 GB index to play with]. Will prob take a day or
to get this stuff together.

-----Original Message-----
From: Maurits van Wijland []
Sent: Thursday, September 19, 2002 12:59 PM
To: Lucene Developers List
Cc: Doug Cutting; Spencer, Dave
Subject: Re: Does it make sense to add Bayesian search to lucent?


this is very interesting. I have done something similair with Mutual
Information statistics.
The trick is to build a semantic network. First, you go through all the
terms and perform
a NEAR search within a preset window. Next you calculate the
and word
pairs with a higher than 0,6 probability are added to the semantic

This network can then be referenced when performing a search. Simply
term in the semantic network, and get all related words. Next add them
the query
and set the boost factor properly (for instance, the entered term boost
times 2, and the others
can be boosted related to the probability of the ord pair).

There is a believe that for documents within a fixed domain need to
a semantic network
only when there is a substantial addition of documents to the

Maybe we can swap some code, or in case anyone else is interested, I can
submit my
Mutual Information code to the list? But it is based on my Dutch
and dutch stopwords.

Doug, you have much experience in this area of statistics and
clustering, I
recall some documents
on Gather/Scatter. Is Lucene going this way or will Lucene remain a Free
Text search engine?
What are the plans for the near future of Lucene?


maurits van wijland

----- Original Message -----
From: "Spencer, Dave" <>
To: <>
Sent: Thursday, September 19, 2002 8:54 PM
Subject: Does it make sense to add Bayesian search to lucent?

I'm not convinced I understand Bayesian search/logic/probabilities.
I thought this was a great article about the company Autonomy in Wired
and it gives a kind of technical intro to it and surprisingly for such
magazines, there's even a math equation it in(!).

It gives what I believe is a simplified version of Baye's theorem:

P(t|y) = [ P(y|t)P(t) ] / P(y)

P(t|y) is the probability that 't' implies 'y'.
P(t) is the probability of 't' happening.

In the context of a search engine I think 't' and 'y' are
two different words - the equation says that if you're searching
for 't', the probability that 'y' is relevant is, well, the right side
of the equation.

To calculate the right side you do the following:

P(t) is the prob of 't' occurring in a document, thus searching with the
query "+t" and
calling Hits.length() returns this (well, you divide the the # of docs
in the collection
IndexReader.numDocs() to get the probability).

P(y) is done the same way.

P(y|t), I believe, is done as follows. Is it the prob of both words
divided by the number of docs that y occurs, so it's the count for "+y
+t" divided
by the count for "+y".

Then voilla, you can calculate it out.

I believe then that the way Lucene can be used to support this kind of
logic is
once the index is built go thru all terms (IndexReader.terms()) and
P(t|y) for every pair of words indexed.

Then if someone queries on 't' you do something similar to what a fuzzy
search does - you
form a larger query and boost the other words based on their probability
(i.e. P(t|y)).
The result is that you might get
docs you didn't expect i.e. docs without 't' (the term you were
searching on) which is
where this becomes valuable - it can find docs w/o any of your search
terms as you search terms
in a sense imply that other words are relevant.

I have 2 test programs:
[1] Given a pair of words and an index this app computes P(t|y)
[2] Given one "starting word" (i.e. 't') this one computes P(t|y) for
all other terms

#1 runs fast enough while #2 is similar to what would be done in
production, and takes
forever on a large index (several hours on my 1GB/3Million record index
of the content).

Now for the questions - is the above generally right i.e.
(a) my descr of how to calculate P(t|y)
(b) my descr of the need to precalc P(t|y) for all pairs of words in the
index, and thus
it must take forever to do this in a large index [(b1) does Autonomy do

And the one other thing - I'm still checking my work but in some cases
P(t|y) is > 1.0, which
seems to make no sense as probabilities must be between 0 and 1 , but in
the equation of
P(t) is high and P(y) is low then maybe this "makes sense" and thus the
equation is not always in
the [0...1] range.

I'm also interested in more, not-too-technical info on Bayesian search.
Various docs I found  on
where not what I was looking for.


 I hope I hit the relevant mailing list - sorry if this is outside the
dev charter.


Here's the output of my test program when I asked it to calcluate P(t|y)
for t='linux'
going over index of dmoz (index contains title+desc). It seem to make
some sense - the
case I was interested in was y=penguin (the linux mascot) which ranks in
at 65%, thus
if you searched on linux you might be (pleasantly?) suprised to get docs
with pictures
of baby penguins or whatnot. Output is all words < 100% and > 50%. I
have not audited
this extensively, esp as to why ayuda is so high(means help in
Spanish..). Might be that
small #'s of matches gives scores too high.

         99.42% ayuda
         99.42% customers
         99.42% packet
         92.36% disk
         92.06% caen
         92.06% gopher
         92.06% kiosk
         92.06% setting
         89.79% beginner
         89.54% rexx
         89.54% soluciones
         88.68% buscador
         88.68% columns
         88.68% input
         88.68% lining
         88.68% tyneside
         87.23% anesthesia
         87.23% fortran
         85.49% alternativa
         85.49% cease
         85.49% iomega
         85.49% linus
         85.49% respect
         85.49% rh
         85.49% streams
         82.46% beehive
         82.46% mastering
         82.46% techtv
         80.81% demos
         80.81% fs
         80.78% sybase
         80.17% tk
         79.59% bizkaia
         79.59% delaney
         79.59% lilac
         79.59% millions
         78.83% todos
         76.87% avi
         76.87% centurion
         76.87% emirates
         76.87% ett
         76.87% fidonet
         76.87% libri
         76.87% taken
         76.87% zmir
         75.08% monterrey
         74.57% tipps
         74.29% clic
         74.29% forensics
         74.29% rampage
         74.29% wr
         73.80% modem
         71.83% ee
         71.83% freiburger
         71.83% mic
         71.83% poder
         71.80% plug
         71.58% apps
         70.37% joins
         69.93% pluto
         69.70% unix
         69.50% azur
         69.50% dilemma
         69.50% lis
         69.50% meal
         69.50% yhdistys
         68.33% wan
         67.27% balancing
         67.27% broadcasters
         67.27% diverse
         67.27% embrace
         67.27% finances
         67.27% processors
         67.08% drivers
         65.97% cluster
         65.30% penguin
         65.29% proxy
         65.15% faithful
         65.15% macs
         65.15% touts
         65.12% hat
         64.14% kde
         64.13% administrators
         63.85% tuxedo
         63.13% feds
         63.13% handicapped
         63.13% ids
         63.13% readies
         63.13% waikato
         61.21% improving
         61.21% inventor
         61.21% spiele
         61.21% stavanger
         60.28% powered
         59.84% tome
         59.80% checklist
         59.37% announces
         59.37% bursa
         59.37% cali
         59.37% chennai
         59.37% curve
         59.37% washtenaw
         59.16% beginners
         58.53% nutshell
         58.48% benchmark
         57.61% beos
         57.61% modules
         57.61% nella
         57.61% shut
         55.93% alford
         55.93% australien
         55.93% promoting
         55.93% tivoli
         55.54% parallel
         54.52% enthusiasts
         54.32% choosing
         54.32% pretoria
         54.32% scheduling
         54.32% schwabach
         54.32% szczecin
         54.32% unicode
         53.85% intranet
         52.99% backup
         52.78% latvian
         52.78% opengl
         52.78% processor
         52.78% vendors
         52.12% tcp
         51.72% armada
         51.30% longer
         51.30% weinheim

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message