Return-Path: Delivered-To: apmail-jakarta-lucene-dev-archive@apache.org Received: (qmail 62167 invoked from network); 19 Sep 2002 18:54:20 -0000 Received: from unknown (HELO nagoya.betaversion.org) (192.18.49.131) by daedalus.apache.org with SMTP; 19 Sep 2002 18:54:20 -0000 Received: (qmail 4801 invoked by uid 97); 19 Sep 2002 18:54:59 -0000 Delivered-To: qmlist-jakarta-archive-lucene-dev@jakarta.apache.org Received: (qmail 4620 invoked by uid 97); 19 Sep 2002 18:54:58 -0000 Mailing-List: contact lucene-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Developers List" Reply-To: "Lucene Developers List" Delivered-To: mailing list lucene-dev@jakarta.apache.org Received: (qmail 4585 invoked by uid 98); 19 Sep 2002 18:54:57 -0000 X-Antivirus: nagoya (v4218 created Aug 14 2002) X-MimeOLE: Produced By Microsoft Exchange V6.0.4712.0 content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----_=_NextPart_001_01C2600D.F2569704" Subject: Does it make sense to add Bayesian search to lucent? Date: Thu, 19 Sep 2002 11:54:13 -0700 Message-ID: <728DA21B8941A843A7C496F1ACF485185AD33A@gleam.lumos.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Does it make sense to add Bayesian search to lucent? Thread-Index: AcJgDfJJGNcayhQ3TRqiNaP71BsOqg== From: "Spencer, Dave" To: X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N ------_=_NextPart_001_01C2600D.F2569704 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable I'm not convinced I understand Bayesian search/logic/probabilities. I thought this was a great article about the company Autonomy in Wired and it gives a kind of technical intro to it and surprisingly for such magazines, there's even a math equation it in(!). =20 http://www.wired.com/wired/archive/8.02/autonomy_pr.html =20 It gives what I believe is a simplified version of Baye's theorem: =20 P(t|y) =3D [ P(y|t)P(t) ] / P(y) =20 P(t|y) is the probability that 't' implies 'y'. P(t) is the probability of 't' happening. =20 In the context of a search engine I think 't' and 'y' are two different words - the equation says that if you're searching for 't', the probability that 'y' is relevant is, well, the right side of the equation. =20 To calculate the right side you do the following: =20 P(t) is the prob of 't' occurring in a document, thus searching with the query "+t" and calling Hits.length() returns this (well, you divide the the # of docs in the collection IndexReader.numDocs() to get the probability). =20 P(y) is done the same way. =20 P(y|t), I believe, is done as follows. Is it the prob of both words occurring divided by the number of docs that y occurs, so it's the count for "+y +t" divided by the count for "+y". =20 Then voilla, you can calculate it out. =20 I believe then that the way Lucene can be used to support this kind of logic is once the index is built go thru all terms (IndexReader.terms()) and compute P(t|y) for every pair of words indexed. =20 Then if someone queries on 't' you do something similar to what a fuzzy search does - you form a larger query and boost the other words based on their probability (i.e. P(t|y)).=20 The result is that you might get docs you didn't expect i.e. docs without 't' (the term you were searching on) which is where this becomes valuable - it can find docs w/o any of your search terms as you search terms in a sense imply that other words are relevant. =20 I have 2 test programs: [1] Given a pair of words and an index this app computes P(t|y) [2] Given one "starting word" (i.e. 't') this one computes P(t|y) for all other terms =20 #1 runs fast enough while #2 is similar to what would be done in production, and takes forever on a large index (several hours on my 1GB/3Million record index of the dmoz.org content). =20 Now for the questions - is the above generally right i.e.=20 (a) my descr of how to calculate P(t|y) (b) my descr of the need to precalc P(t|y) for all pairs of words in the index, and thus it must take forever to do this in a large index [(b1) does Autonomy do this?]. =20 And the one other thing - I'm still checking my work but in some cases P(t|y) is > 1.0, which seems to make no sense as probabilities must be between 0 and 1 , but in the equation of P(t) is high and P(y) is low then maybe this "makes sense" and thus the equation is not always in=20 the [0...1] range. =20 I'm also interested in more, not-too-technical info on Bayesian search. Various docs I found on citeseer.com where not what I was looking for. =20 thx,=20 Dave =20 PS=20 I hope I hit the relevant mailing list - sorry if this is outside the dev charter. =20 =20 PPS =20 Here's the output of my test program when I asked it to calcluate P(t|y) for t=3D'linux' going over index of dmoz (index contains title+desc). It seem to make some sense - the case I was interested in was y=3Dpenguin (the linux mascot) which ranks = in at 65%, thus if you searched on linux you might be (pleasantly?) suprised to get docs with pictures of baby penguins or whatnot. Output is all words < 100% and > 50%. I have not audited this extensively, esp as to why ayuda is so high(means help in Spanish..). Might be that small #'s of matches gives scores too high. =20 =20 =20 99.42% ayuda 99.42% customers 99.42% packet 92.36% disk 92.06% caen 92.06% gopher 92.06% kiosk 92.06% setting 89.79% beginner 89.54% rexx 89.54% soluciones 88.68% buscador 88.68% columns 88.68% input 88.68% lining 88.68% tyneside 87.23% anesthesia 87.23% fortran 85.49% alternativa 85.49% cease 85.49% iomega 85.49% linus 85.49% respect 85.49% rh 85.49% streams 82.46% beehive 82.46% mastering 82.46% techtv 80.81% demos 80.81% fs 80.78% sybase 80.17% tk 79.59% bizkaia 79.59% delaney 79.59% lilac 79.59% millions 78.83% todos 76.87% avi 76.87% centurion 76.87% emirates 76.87% ett 76.87% fidonet 76.87% libri 76.87% taken 76.87% zmir 75.08% monterrey 74.57% tipps 74.29% clic 74.29% forensics 74.29% rampage 74.29% wr 73.80% modem 71.83% ee 71.83% freiburger 71.83% mic 71.83% poder 71.80% plug 71.58% apps 70.37% joins 69.93% pluto 69.70% unix 69.50% azur 69.50% dilemma 69.50% lis 69.50% meal 69.50% yhdistys 68.33% wan 67.27% balancing 67.27% broadcasters 67.27% diverse 67.27% embrace 67.27% finances 67.27% processors 67.08% drivers 65.97% cluster 65.30% penguin 65.29% proxy 65.15% faithful 65.15% macs 65.15% touts 65.12% hat 64.14% kde 64.13% administrators 63.85% tuxedo 63.13% feds 63.13% handicapped 63.13% ids 63.13% readies 63.13% waikato 61.21% improving 61.21% inventor 61.21% spiele 61.21% stavanger 60.28% powered 59.84% tome 59.80% checklist 59.37% announces 59.37% bursa 59.37% cali 59.37% chennai 59.37% curve 59.37% washtenaw 59.16% beginners 58.53% nutshell 58.48% benchmark 57.61% beos 57.61% modules 57.61% nella 57.61% shut 55.93% alford 55.93% australien 55.93% promoting 55.93% tivoli 55.54% parallel 54.52% enthusiasts 54.32% choosing 54.32% pretoria 54.32% scheduling 54.32% schwabach 54.32% szczecin 54.32% unicode 53.85% intranet 52.99% backup 52.78% latvian 52.78% opengl 52.78% processor 52.78% vendors 52.12% tcp 51.72% armada 51.30% longer 51.30% weinheim =20 =20 ------_=_NextPart_001_01C2600D.F2569704--