Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----_=_NextPart_001_01C2600D.F2569704"
Subject: Does it make sense to add Bayesian search to lucent?
Date: Thu, 19 Sep 2002 11:54:13 -0700
Message-ID: <728DA21B8941A843A7C496F1ACF485185AD33A@gleam.lumos.com>
Thread-Topic: Does it make sense to add Bayesian search to lucent?
Thread-Index: AcJgDfJJGNcayhQ3TRqiNaP71BsOqg==
From: "Spencer, Dave" <dave@lumos.com>
To: <lucene-dev@jakarta.apache.org>

------_=_NextPart_001_01C2600D.F2569704
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

I'm not convinced I understand Bayesian search/logic/probabilities.
I thought this was a great article about the company Autonomy in Wired
and it gives a kind of technical intro to it and surprisingly for such
magazines, there's even a math equation it in(!).
=20
http://www.wired.com/wired/archive/8.02/autonomy_pr.html
=20
It gives what I believe is a simplified version of Baye's theorem:
=20
P(t|y) =3D [ P(y|t)P(t) ] / P(y)
=20
P(t|y) is the probability that 't' implies 'y'.
P(t) is the probability of 't' happening.
=20
In the context of a search engine I think 't' and 'y' are
two different words - the equation says that if you're searching
for 't', the probability that 'y' is relevant is, well, the right side
of the equation.
=20
To calculate the right side you do the following:
=20
P(t) is the prob of 't' occurring in a document, thus searching with the
query "+t" and
calling Hits.length() returns this (well, you divide the the # of docs
in the collection
IndexReader.numDocs() to get the probability).
=20
P(y) is done the same way.
=20
P(y|t), I believe, is done as follows. Is it the prob of both words
occurring
divided by the number of docs that y occurs, so it's the count for "+y
+t" divided
by the count for "+y".
=20
Then voilla, you can calculate it out.
=20
I believe then that the way Lucene can be used to support this kind of
logic is
once the index is built go thru all terms (IndexReader.terms()) and
compute
P(t|y) for every pair of words indexed.
=20
Then if someone queries on 't' you do something similar to what a fuzzy
search does - you
form a larger query and boost the other words based on their probability
(i.e. P(t|y)).=20
The result is that you might get
docs you didn't expect i.e. docs without 't' (the term you were
searching on) which is
where this becomes valuable - it can find docs w/o any of your search
terms as you search terms
in a sense imply that other words are relevant.
=20
I have 2 test programs:
[1] Given a pair of words and an index this app computes P(t|y)
[2] Given one "starting word" (i.e. 't') this one computes P(t|y) for
all other terms
=20
#1 runs fast enough while #2 is similar to what would be done in
production, and takes
forever on a large index (several hours on my 1GB/3Million record index
of the dmoz.org content).
=20
Now for the questions - is the above generally right i.e.=20
(a) my descr of how to calculate P(t|y)
(b) my descr of the need to precalc P(t|y) for all pairs of words in the
index, and thus
it must take forever to do this in a large index [(b1) does Autonomy do
this?].
=20
And the one other thing - I'm still checking my work but in some cases
P(t|y) is > 1.0, which
seems to make no sense as probabilities must be between 0 and 1 , but in
the equation of
P(t) is high and P(y) is low then maybe this "makes sense" and thus the
equation is not always in=20
the [0...1] range.
=20
I'm also interested in more, not-too-technical info on Bayesian search.
Various docs I found  on citeseer.com
where not what I was looking for.
=20
thx,=20
 Dave
=20
PS=20
 I hope I hit the relevant mailing list - sorry if this is outside the
dev charter.
=20
=20
PPS
=20
Here's the output of my test program when I asked it to calcluate P(t|y)
for t=3D'linux'
going over index of dmoz (index contains title+desc). It seem to make
some sense - the
case I was interested in was y=3Dpenguin (the linux mascot) which ranks =
in
at 65%, thus
if you searched on linux you might be (pleasantly?) suprised to get docs
with pictures
of baby penguins or whatnot. Output is all words < 100% and > 50%. I
have not audited
this extensively, esp as to why ayuda is so high(means help in
Spanish..). Might be that
small #'s of matches gives scores too high.
=20
=20
=20
         99.42% ayuda
         99.42% customers
         99.42% packet
         92.36% disk
         92.06% caen
         92.06% gopher
         92.06% kiosk
         92.06% setting
         89.79% beginner
         89.54% rexx
         89.54% soluciones
         88.68% buscador
         88.68% columns
         88.68% input
         88.68% lining
         88.68% tyneside
         87.23% anesthesia
         87.23% fortran
         85.49% alternativa
         85.49% cease
         85.49% iomega
         85.49% linus
         85.49% respect
         85.49% rh
         85.49% streams
         82.46% beehive
         82.46% mastering
         82.46% techtv
         80.81% demos
         80.81% fs
         80.78% sybase
         80.17% tk
         79.59% bizkaia
         79.59% delaney
         79.59% lilac
         79.59% millions
         78.83% todos
         76.87% avi
         76.87% centurion
         76.87% emirates
         76.87% ett
         76.87% fidonet
         76.87% libri
         76.87% taken
         76.87% zmir
         75.08% monterrey
         74.57% tipps
         74.29% clic
         74.29% forensics
         74.29% rampage
         74.29% wr
         73.80% modem
         71.83% ee
         71.83% freiburger
         71.83% mic
         71.83% poder
         71.80% plug
         71.58% apps
         70.37% joins
         69.93% pluto
         69.70% unix
         69.50% azur
         69.50% dilemma
         69.50% lis
         69.50% meal
         69.50% yhdistys
         68.33% wan
         67.27% balancing
         67.27% broadcasters
         67.27% diverse
         67.27% embrace
         67.27% finances
         67.27% processors
         67.08% drivers
         65.97% cluster
         65.30% penguin
         65.29% proxy
         65.15% faithful
         65.15% macs
         65.15% touts
         65.12% hat
         64.14% kde
         64.13% administrators
         63.85% tuxedo
         63.13% feds
         63.13% handicapped
         63.13% ids
         63.13% readies
         63.13% waikato
         61.21% improving
         61.21% inventor
         61.21% spiele
         61.21% stavanger
         60.28% powered
         59.84% tome
         59.80% checklist
         59.37% announces
         59.37% bursa
         59.37% cali
         59.37% chennai
         59.37% curve
         59.37% washtenaw
         59.16% beginners
         58.53% nutshell
         58.48% benchmark
         57.61% beos
         57.61% modules
         57.61% nella
         57.61% shut
         55.93% alford
         55.93% australien
         55.93% promoting
         55.93% tivoli
         55.54% parallel
         54.52% enthusiasts
         54.32% choosing
         54.32% pretoria
         54.32% scheduling
         54.32% schwabach
         54.32% szczecin
         54.32% unicode
         53.85% intranet
         52.99% backup
         52.78% latvian
         52.78% opengl
         52.78% processor
         52.78% vendors
         52.12% tcp
         51.72% armada
         51.30% longer
         51.30% weinheim

=20
=20

------_=_NextPart_001_01C2600D.F2569704--