lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject RE: Does it make sense to add Bayesian search to lucent?
Date Thu, 26 Sep 2002 22:22:40 GMT
I'd like to see all the code mentioned :)

Also, recently I read an article about a CBR tool called Selection
Engine, in one of the recent JavaPRO issues.  The project if on SF.net.
Doug mentioned refactoring Lucene to allow the similarity computation
to be more pluggable, I believe.
I wonder if that would allow you and others to plug in different
components for calculating similarity, like the CBR tool mentioned,
this Bayesian approach, etc.

Otis


--- "Spencer, Dave" <dave@lumos.com> wrote:
> Thanks for info, sounds interesting.
> Yes, I'd like to see what you've done.
> I'll try to post my 2 Bayesian "test" programs + my dmoz.org indexer
> [the later
> in case people want a 1 GB index to play with]. Will prob take a day
> or
> two
> to get this stuff together.
> 
> -----Original Message-----
> From: Maurits van Wijland [mailto:m.vanwijland@quicknet.nl]
> Sent: Thursday, September 19, 2002 12:59 PM
> To: Lucene Developers List
> Cc: Doug Cutting; Spencer, Dave
> Subject: Re: Does it make sense to add Bayesian search to lucent?
> 
> 
> Dave,
> 
> this is very interesting. I have done something similair with Mutual
> Information statistics.
> The trick is to build a semantic network. First, you go through all
> the
> terms and perform
> a NEAR search within a preset window. Next you calculate the
> probability,
> and word
> pairs with a higher than 0,6 probability are added to the semantic
> network.
> 
> This network can then be referenced when performing a search. Simply
> search
> the
> term in the semantic network, and get all related words. Next add
> them
> to
> the query
> and set the boost factor properly (for instance, the entered term
> boost
> this
> times 2, and the others
> can be boosted related to the probability of the ord pair).
> 
> There is a believe that for documents within a fixed domain need to
> rebuild
> a semantic network
> only when there is a substantial addition of documents to the
> collection.
> 
> Maybe we can swap some code, or in case anyone else is interested, I
> can
> submit my
> Mutual Information code to the list? But it is based on my Dutch
> Analyzer
> and dutch stopwords.
> 
> Doug, you have much experience in this area of statistics and
> clustering, I
> recall some documents
> on Gather/Scatter. Is Lucene going this way or will Lucene remain a
> Free
> Text search engine?
> What are the plans for the near future of Lucene?
> 
> regards,
> 
> maurits van wijland
> 
> 
> ----- Original Message -----
> From: "Spencer, Dave" <dave@lumos.com>
> To: <lucene-dev@jakarta.apache.org>
> Sent: Thursday, September 19, 2002 8:54 PM
> Subject: Does it make sense to add Bayesian search to lucent?
> 
> 
> I'm not convinced I understand Bayesian search/logic/probabilities.
> I thought this was a great article about the company Autonomy in
> Wired
> and it gives a kind of technical intro to it and surprisingly for
> such
> magazines, there's even a math equation it in(!).
> 
> http://www.wired.com/wired/archive/8.02/autonomy_pr.html
> 
> It gives what I believe is a simplified version of Baye's theorem:
> 
> P(t|y) = [ P(y|t)P(t) ] / P(y)
> 
> P(t|y) is the probability that 't' implies 'y'.
> P(t) is the probability of 't' happening.
> 
> In the context of a search engine I think 't' and 'y' are
> two different words - the equation says that if you're searching
> for 't', the probability that 'y' is relevant is, well, the right
> side
> of the equation.
> 
> To calculate the right side you do the following:
> 
> P(t) is the prob of 't' occurring in a document, thus searching with
> the
> query "+t" and
> calling Hits.length() returns this (well, you divide the the # of
> docs
> in the collection
> IndexReader.numDocs() to get the probability).
> 
> P(y) is done the same way.
> 
> P(y|t), I believe, is done as follows. Is it the prob of both words
> occurring
> divided by the number of docs that y occurs, so it's the count for
> "+y
> +t" divided
> by the count for "+y".
> 
> Then voilla, you can calculate it out.
> 
> I believe then that the way Lucene can be used to support this kind
> of
> logic is
> once the index is built go thru all terms (IndexReader.terms()) and
> compute
> P(t|y) for every pair of words indexed.
> 
> Then if someone queries on 't' you do something similar to what a
> fuzzy
> search does - you
> form a larger query and boost the other words based on their
> probability
> (i.e. P(t|y)).
> The result is that you might get
> docs you didn't expect i.e. docs without 't' (the term you were
> searching on) which is
> where this becomes valuable - it can find docs w/o any of your search
> terms as you search terms
> in a sense imply that other words are relevant.
> 
> I have 2 test programs:
> [1] Given a pair of words and an index this app computes P(t|y)
> [2] Given one "starting word" (i.e. 't') this one computes P(t|y) for
> all other terms
> 
> #1 runs fast enough while #2 is similar to what would be done in
> production, and takes
> forever on a large index (several hours on my 1GB/3Million record
> index
> of the dmoz.org content).
> 
> Now for the questions - is the above generally right i.e.
> (a) my descr of how to calculate P(t|y)
> (b) my descr of the need to precalc P(t|y) for all pairs of words in
> the
> index, and thus
> it must take forever to do this in a large index [(b1) does Autonomy
> do
> this?].
> 
> And the one other thing - I'm still checking my work but in some
> cases
> P(t|y) is > 1.0, which
> seems to make no sense as probabilities must be between 0 and 1 , but
> in
> the equation of
> P(t) is high and P(y) is low then maybe this "makes sense" and thus
> the
> equation is not always in
> the [0...1] range.
> 
> I'm also interested in more, not-too-technical info on Bayesian
> search.
> Various docs I found  on citeseer.com
> where not what I was looking for.
> 
> thx,
>  Dave
> 
> PS
>  I hope I hit the relevant mailing list - sorry if this is outside
> the
> dev charter.
> 
> 
> PPS
> 
> Here's the output of my test program when I asked it to calcluate
> P(t|y)
> for t='linux'
> going over index of dmoz (index contains title+desc). It seem to make
> some sense - the
> case I was interested in was y=penguin (the linux mascot) which ranks
> in
> at 65%, thus
> if you searched on linux you might be (pleasantly?) suprised to get
> docs
> with pictures
> of baby penguins or whatnot. Output is all words < 100% and > 50%. I
> have not audited
> this extensively, esp as to why ayuda is so high(means help in
> Spanish..). Might be that
> small #'s of matches gives scores too high.
> 
> 
> 
>          99.42% ayuda
>          99.42% customers
>          99.42% packet
>          92.36% disk
>          92.06% caen
>          92.06% gopher
>          92.06% kiosk
>          92.06% setting
>          89.79% beginner
>          89.54% rexx
>          89.54% soluciones
>          88.68% buscador
>          88.68% columns
>          88.68% input
>          88.68% lining
>          88.68% tyneside
>          87.23% anesthesia
>          87.23% fortran
>          85.49% alternativa
>          85.49% cease
>          85.49% iomega
>          85.49% linus
>          85.49% respect
>          85.49% rh
>          85.49% streams
>          82.46% beehive
>          82.46% mastering
>          82.46% techtv
>          80.81% demos
>          80.81% fs
>          80.78% sybase
>          80.17% tk
>          79.59% bizkaia
>          79.59% delaney
>          79.59% lilac
>          79.59% millions
>          78.83% todos
>          76.87% avi
>          76.87% centurion
>          76.87% emirates
>          76.87% ett
>          76.87% fidonet
>          76.87% libri
>          76.87% taken
>          76.87% zmir
>          75.08% monterrey
>          74.57% tipps
>          74.29% clic
>          74.29% forensics
>          74.29% rampage
>          74.29% wr
>          73.80% modem
>          71.83% ee
>          71.83% freiburger
>          71.83% mic
>          71.83% poder
>          71.80% plug
>          71.58% apps
>          70.37% joins
>          69.93% pluto
>          69.70% unix
>          69.50% azur
>          69.50% dilemma
>          69.50% lis
>          69.50% meal
>          69.50% yhdistys
>          68.33% wan
>          67.27% balancing
>          67.27% broadcasters
>          67.27% diverse
>          67.27% embrace
>          67.27% finances
>          67.27% processors
>          67.08% drivers
>          65.97% cluster
>          65.30% penguin
>          65.29% proxy
>          65.15% faithful
>          65.15% macs
>          65.15% touts
>          65.12% hat
>          64.14% kde
>          64.13% administrators
>          63.85% tuxedo
>          63.13% feds
>          63.13% handicapped
>          63.13% ids
>          63.13% readies
>          63.13% waikato
>          61.21% improving
>          61.21% inventor
>          61.21% spiele
>          61.21% stavanger
>          60.28% powered
>          59.84% tome
>          59.80% checklist
>          59.37% announces
>          59.37% bursa
>          59.37% cali
>          59.37% chennai
>          59.37% curve
>          59.37% washtenaw
>          59.16% beginners
>          58.53% nutshell
>          58.48% benchmark
>          57.61% beos
>          57.61% modules
>          57.61% nella
>          57.61% shut
>          55.93% alford
>          55.93% australien
>          55.93% promoting
>          55.93% tivoli
>          55.54% parallel
>          54.52% enthusiasts
>          54.32% choosing
>          54.32% pretoria
>          54.32% scheduling
>          54.32% schwabach
>          54.32% szczecin
>          54.32% unicode
>          53.85% intranet
>          52.99% backup
>          52.78% latvian
>          52.78% opengl
>          52.78% processor
>          52.78% vendors
>          52.12% tcp
>          51.72% armada
>          51.30% longer
>          51.30% weinheim
> 
> 
> 
> 
> 
> 
> 
> --
> To unsubscribe, e-mail:  
> <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail:
> <mailto:lucene-dev-help@jakarta.apache.org>
> 


__________________________________________________
Do you Yahoo!?
New DSL Internet Access from SBC & Yahoo!
http://sbc.yahoo.com

--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


Mime
View raw message