# lucene-dev mailing list archives

##### Site index · List index
Message view
Top
From Jack Park <jackp...@thinkalong.com>
Subject Re: Does it make sense to add Bayesian search to lucent?
Date Thu, 19 Sep 2002 19:17:06 GMT
```My response to your question "did I reach the right list?" is: Yes. I've
mentioned Latent Semantic Analysis on this list before. Bayes, I think, is
just as valid and important. I am personally eager to see something like
Lucene serve a variety of purposes in knowledge space.

FWIW, in order to avoid the one-size-fits-all bloated application that
Lucene could easily become, I am wondering if there is a way to perform a
major refactoring of the Lucene concept into a society of agents, all
coordinated by, say, tuplespace.

Just a few EUROs...
Jack

At 11:54 AM 9/19/2002 -0700, you wrote:
>I'm not convinced I understand Bayesian search/logic/probabilities.
>I thought this was a great article about the company Autonomy in Wired
>and it gives a kind of technical intro to it and surprisingly for such
>magazines, there's even a math equation it in(!).
>
>http://www.wired.com/wired/archive/8.02/autonomy_pr.html
>
>It gives what I believe is a simplified version of Baye's theorem:
>
>P(t|y) = [ P(y|t)P(t) ] / P(y)
>
>P(t|y) is the probability that 't' implies 'y'.
>P(t) is the probability of 't' happening.
>
>In the context of a search engine I think 't' and 'y' are
>two different words - the equation says that if you're searching
>for 't', the probability that 'y' is relevant is, well, the right side
>of the equation.
>
>To calculate the right side you do the following:
>
>P(t) is the prob of 't' occurring in a document, thus searching with the
>query "+t" and
>calling Hits.length() returns this (well, you divide the the # of docs
>in the collection
>
>P(y) is done the same way.
>
>P(y|t), I believe, is done as follows. Is it the prob of both words
>occurring
>divided by the number of docs that y occurs, so it's the count for "+y
>+t" divided
>by the count for "+y".
>
>Then voilla, you can calculate it out.
>
>I believe then that the way Lucene can be used to support this kind of
>logic is
>once the index is built go thru all terms (IndexReader.terms()) and
>compute
>P(t|y) for every pair of words indexed.
>
>Then if someone queries on 't' you do something similar to what a fuzzy
>search does - you
>form a larger query and boost the other words based on their probability
>(i.e. P(t|y)).
>The result is that you might get
>docs you didn't expect i.e. docs without 't' (the term you were
>searching on) which is
>where this becomes valuable - it can find docs w/o any of your search
>terms as you search terms
>in a sense imply that other words are relevant.
>
>I have 2 test programs:
>[1] Given a pair of words and an index this app computes P(t|y)
>[2] Given one "starting word" (i.e. 't') this one computes P(t|y) for
>all other terms
>
>#1 runs fast enough while #2 is similar to what would be done in
>production, and takes
>forever on a large index (several hours on my 1GB/3Million record index
>of the dmoz.org content).
>
>Now for the questions - is the above generally right i.e.
>(a) my descr of how to calculate P(t|y)
>(b) my descr of the need to precalc P(t|y) for all pairs of words in the
>index, and thus
>it must take forever to do this in a large index [(b1) does Autonomy do
>this?].
>
>And the one other thing - I'm still checking my work but in some cases
>P(t|y) is > 1.0, which
>seems to make no sense as probabilities must be between 0 and 1 , but in
>the equation of
>P(t) is high and P(y) is low then maybe this "makes sense" and thus the
>equation is not always in
>the [0...1] range.
>
>I'm also interested in more, not-too-technical info on Bayesian search.
>Various docs I found  on citeseer.com
>where not what I was looking for.
>
>thx,
>  Dave
>
>PS
>  I hope I hit the relevant mailing list - sorry if this is outside the
>dev charter.
>
>
>PPS
>
>Here's the output of my test program when I asked it to calcluate P(t|y)
>for t='linux'
>going over index of dmoz (index contains title+desc). It seem to make
>some sense - the
>case I was interested in was y=penguin (the linux mascot) which ranks in
>at 65%, thus
>if you searched on linux you might be (pleasantly?) suprised to get docs
>with pictures
>of baby penguins or whatnot. Output is all words < 100% and > 50%. I
>have not audited
>this extensively, esp as to why ayuda is so high(means help in
>Spanish..). Might be that
>small #'s of matches gives scores too high.
>
>
>
>          99.42% ayuda
>          99.42% customers
>          99.42% packet
>          92.36% disk
>          92.06% caen
>          92.06% gopher
>          92.06% kiosk
>          92.06% setting
>          89.79% beginner
>          89.54% rexx
>          89.54% soluciones
>          88.68% columns
>          88.68% input
>          88.68% lining
>          88.68% tyneside
>          87.23% anesthesia
>          87.23% fortran
>          85.49% alternativa
>          85.49% cease
>          85.49% iomega
>          85.49% linus
>          85.49% respect
>          85.49% rh
>          85.49% streams
>          82.46% beehive
>          82.46% mastering
>          82.46% techtv
>          80.81% demos
>          80.81% fs
>          80.78% sybase
>          80.17% tk
>          79.59% bizkaia
>          79.59% delaney
>          79.59% lilac
>          79.59% millions
>          78.83% todos
>          76.87% avi
>          76.87% centurion
>          76.87% emirates
>          76.87% ett
>          76.87% fidonet
>          76.87% libri
>          76.87% taken
>          76.87% zmir
>          75.08% monterrey
>          74.57% tipps
>          74.29% clic
>          74.29% forensics
>          74.29% rampage
>          74.29% wr
>          73.80% modem
>          71.83% ee
>          71.83% freiburger
>          71.83% mic
>          71.83% poder
>          71.80% plug
>          71.58% apps
>          70.37% joins
>          69.93% pluto
>          69.70% unix
>          69.50% azur
>          69.50% dilemma
>          69.50% lis
>          69.50% meal
>          69.50% yhdistys
>          68.33% wan
>          67.27% balancing
>          67.27% diverse
>          67.27% embrace
>          67.27% finances
>          67.27% processors
>          67.08% drivers
>          65.97% cluster
>          65.30% penguin
>          65.29% proxy
>          65.15% faithful
>          65.15% macs
>          65.15% touts
>          65.12% hat
>          64.14% kde
>          63.85% tuxedo
>          63.13% feds
>          63.13% handicapped
>          63.13% ids
>          63.13% waikato
>          61.21% improving
>          61.21% inventor
>          61.21% spiele
>          61.21% stavanger
>          60.28% powered
>          59.84% tome
>          59.80% checklist
>          59.37% announces
>          59.37% bursa
>          59.37% cali
>          59.37% chennai
>          59.37% curve
>          59.37% washtenaw
>          59.16% beginners
>          58.53% nutshell
>          58.48% benchmark
>          57.61% beos
>          57.61% modules
>          57.61% nella
>          57.61% shut
>          55.93% alford
>          55.93% australien
>          55.93% promoting
>          55.93% tivoli
>          55.54% parallel
>          54.52% enthusiasts
>          54.32% choosing
>          54.32% pretoria
>          54.32% scheduling
>          54.32% schwabach
>          54.32% szczecin
>          54.32% unicode
>          53.85% intranet
>          52.99% backup
>          52.78% latvian
>          52.78% opengl
>          52.78% processor
>          52.78% vendors
>          52.12% tcp
>          51.30% longer
>          51.30% weinheim
>
>
>

---------------------------------------------------------------------------
XML Topic Maps: Creating and Using Topic Maps for the Web.