My response to your question "did I reach the right list?" is: Yes. I've
mentioned Latent Semantic Analysis on this list before. Bayes, I think, is
just as valid and important. I am personally eager to see something like
Lucene serve a variety of purposes in knowledge space.
FWIW, in order to avoid the one-size-fits-all bloated application that
Lucene could easily become, I am wondering if there is a way to perform a
major refactoring of the Lucene concept into a society of agents, all
coordinated by, say, tuplespace.
Just a few EUROs...
Jack
At 11:54 AM 9/19/2002 -0700, you wrote:
>I'm not convinced I understand Bayesian search/logic/probabilities.
>I thought this was a great article about the company Autonomy in Wired
>and it gives a kind of technical intro to it and surprisingly for such
>magazines, there's even a math equation it in(!).
>
>http://www.wired.com/wired/archive/8.02/autonomy_pr.html
>
>It gives what I believe is a simplified version of Baye's theorem:
>
>P(t|y) = [ P(y|t)P(t) ] / P(y)
>
>P(t|y) is the probability that 't' implies 'y'.
>P(t) is the probability of 't' happening.
>
>In the context of a search engine I think 't' and 'y' are
>two different words - the equation says that if you're searching
>for 't', the probability that 'y' is relevant is, well, the right side
>of the equation.
>
>To calculate the right side you do the following:
>
>P(t) is the prob of 't' occurring in a document, thus searching with the
>query "+t" and
>calling Hits.length() returns this (well, you divide the the # of docs
>in the collection
>IndexReader.numDocs() to get the probability).
>
>P(y) is done the same way.
>
>P(y|t), I believe, is done as follows. Is it the prob of both words
>occurring
>divided by the number of docs that y occurs, so it's the count for "+y
>+t" divided
>by the count for "+y".
>
>Then voilla, you can calculate it out.
>
>I believe then that the way Lucene can be used to support this kind of
>logic is
>once the index is built go thru all terms (IndexReader.terms()) and
>compute
>P(t|y) for every pair of words indexed.
>
>Then if someone queries on 't' you do something similar to what a fuzzy
>search does - you
>form a larger query and boost the other words based on their probability
>(i.e. P(t|y)).
>The result is that you might get
>docs you didn't expect i.e. docs without 't' (the term you were
>searching on) which is
>where this becomes valuable - it can find docs w/o any of your search
>terms as you search terms
>in a sense imply that other words are relevant.
>
>I have 2 test programs:
>[1] Given a pair of words and an index this app computes P(t|y)
>[2] Given one "starting word" (i.e. 't') this one computes P(t|y) for
>all other terms
>
>#1 runs fast enough while #2 is similar to what would be done in
>production, and takes
>forever on a large index (several hours on my 1GB/3Million record index
>of the dmoz.org content).
>
>Now for the questions - is the above generally right i.e.
>(a) my descr of how to calculate P(t|y)
>(b) my descr of the need to precalc P(t|y) for all pairs of words in the
>index, and thus
>it must take forever to do this in a large index [(b1) does Autonomy do
>this?].
>
>And the one other thing - I'm still checking my work but in some cases
>P(t|y) is > 1.0, which
>seems to make no sense as probabilities must be between 0 and 1 , but in
>the equation of
>P(t) is high and P(y) is low then maybe this "makes sense" and thus the
>equation is not always in
>the [0...1] range.
>
>I'm also interested in more, not-too-technical info on Bayesian search.
>Various docs I found on citeseer.com
>where not what I was looking for.
>
>thx,
> Dave
>
>PS
> I hope I hit the relevant mailing list - sorry if this is outside the
>dev charter.
>
>
>PPS
>
>Here's the output of my test program when I asked it to calcluate P(t|y)
>for t='linux'
>going over index of dmoz (index contains title+desc). It seem to make
>some sense - the
>case I was interested in was y=penguin (the linux mascot) which ranks in
>at 65%, thus
>if you searched on linux you might be (pleasantly?) suprised to get docs
>with pictures
>of baby penguins or whatnot. Output is all words < 100% and > 50%. I
>have not audited
>this extensively, esp as to why ayuda is so high(means help in
>Spanish..). Might be that
>small #'s of matches gives scores too high.
>
>
>
> 99.42% ayuda
> 99.42% customers
> 99.42% packet
> 92.36% disk
> 92.06% caen
> 92.06% gopher
> 92.06% kiosk
> 92.06% setting
> 89.79% beginner
> 89.54% rexx
> 89.54% soluciones
> 88.68% buscador
> 88.68% columns
> 88.68% input
> 88.68% lining
> 88.68% tyneside
> 87.23% anesthesia
> 87.23% fortran
> 85.49% alternativa
> 85.49% cease
> 85.49% iomega
> 85.49% linus
> 85.49% respect
> 85.49% rh
> 85.49% streams
> 82.46% beehive
> 82.46% mastering
> 82.46% techtv
> 80.81% demos
> 80.81% fs
> 80.78% sybase
> 80.17% tk
> 79.59% bizkaia
> 79.59% delaney
> 79.59% lilac
> 79.59% millions
> 78.83% todos
> 76.87% avi
> 76.87% centurion
> 76.87% emirates
> 76.87% ett
> 76.87% fidonet
> 76.87% libri
> 76.87% taken
> 76.87% zmir
> 75.08% monterrey
> 74.57% tipps
> 74.29% clic
> 74.29% forensics
> 74.29% rampage
> 74.29% wr
> 73.80% modem
> 71.83% ee
> 71.83% freiburger
> 71.83% mic
> 71.83% poder
> 71.80% plug
> 71.58% apps
> 70.37% joins
> 69.93% pluto
> 69.70% unix
> 69.50% azur
> 69.50% dilemma
> 69.50% lis
> 69.50% meal
> 69.50% yhdistys
> 68.33% wan
> 67.27% balancing
> 67.27% broadcasters
> 67.27% diverse
> 67.27% embrace
> 67.27% finances
> 67.27% processors
> 67.08% drivers
> 65.97% cluster
> 65.30% penguin
> 65.29% proxy
> 65.15% faithful
> 65.15% macs
> 65.15% touts
> 65.12% hat
> 64.14% kde
> 64.13% administrators
> 63.85% tuxedo
> 63.13% feds
> 63.13% handicapped
> 63.13% ids
> 63.13% readies
> 63.13% waikato
> 61.21% improving
> 61.21% inventor
> 61.21% spiele
> 61.21% stavanger
> 60.28% powered
> 59.84% tome
> 59.80% checklist
> 59.37% announces
> 59.37% bursa
> 59.37% cali
> 59.37% chennai
> 59.37% curve
> 59.37% washtenaw
> 59.16% beginners
> 58.53% nutshell
> 58.48% benchmark
> 57.61% beos
> 57.61% modules
> 57.61% nella
> 57.61% shut
> 55.93% alford
> 55.93% australien
> 55.93% promoting
> 55.93% tivoli
> 55.54% parallel
> 54.52% enthusiasts
> 54.32% choosing
> 54.32% pretoria
> 54.32% scheduling
> 54.32% schwabach
> 54.32% szczecin
> 54.32% unicode
> 53.85% intranet
> 52.99% backup
> 52.78% latvian
> 52.78% opengl
> 52.78% processor
> 52.78% vendors
> 52.12% tcp
> 51.72% armada
> 51.30% longer
> 51.30% weinheim
>
>
>
---------------------------------------------------------------------------
XML Topic Maps: Creating and Using Topic Maps for the Web.
Addison-Wesley, ISBN 0-201-74960-2.
http://www.nexist.org/wiki/User0Blog
--
To unsubscribe, e-mail:
For additional commands, e-mail: